João Gonçalves

Two Features, One Contract

joaofogoncalves — Sun, 28 Jun 2026 00:00:00 GMT

BridgePort 3.0 has two headline features: a Terraform provider and an MCP server. They were built as separate efforts, in separate epics, and they look unrelated. One is infrastructure-as-code. The other is what people usually mean by “adding AI” to a tool.

They turned out to be two clients of the same thing: BridgePort’s HTTP API. The Terraform provider drives it from a file in version control, and the MCP server drives it from an agent in an editor. They expose different slices of it, one declaring configuration and the other running operations, but neither had to build its own validation, permissions, or audit trail. Both inherited that from the API. Most of the release didn’t go into either feature. It went into making the API stable enough to be that dependency.

The release before this one, 2.0, was about speed. The slowest production transaction dropped from a p99 over 8 seconds to 46 milliseconds. 3.0 is about the API surface instead. Less visible, same general idea: make the control plane something other software can rely on.

What “hardening the API” meant

The phrase is vague until you list the work, and most of it is the kind that doesn’t show up in a demo.

The spec came first. BridgePort was already generating /openapi.json dynamically from its registered routes, but the document was nearly empty of contracts. Of roughly 251 route definitions, three declared a request or response schema, and those three were response-only, added for serialization speed. The real request contracts already existed as Zod schemas, createServerSchema, createServiceSchema, and the rest, used to validate incoming bodies. They just never reached the spec. 3.0 fed them in, converting each one with Zod 4’s built-in z.toJSONSchema() and assembling the document with @fastify/swagger, so the schema that validates a request is now the schema that documents it. One definition, two jobs. A pnpm run openapi:dump script writes a committed openapi.json snapshot, a CI check fails the build when routes change without regenerating it, and a test asserts that a minimum share of operations carry full schemas. The threshold only ratchets up, so a new route added without a schema can’t quietly lower the bar. It just fails the check until someone types it.

Errors got one shape at the same time. Every non-2xx response now returns {code, message, field?, hint?, requestId?} with a documented set of codes (VALIDATION_ERROR, READONLY_FIELD, FORBIDDEN_SCOPE, IDEMPOTENCY_KEY_REUSED, and so on). That was done with a central error handler and an onSend hook that reshapes responses at the framework level, so it didn’t mean editing every route. GET /api/auth/me now returns the caller’s role, environment allowlist, and derived scopes, so a client can tell whether a call will be permitted before making it and catching a 403.

Then the client. The Go client the CLI used internally was already a complete typed SDK, but it sat under internal/ with a module path that didn’t resolve, so nothing else could import it. 3.0 extracted it into a standalone module, github.com/bridgeinpt/bridgeport/client, released on the Go multi-module convention (client/vX.Y.Z). The rule attached to it matters more than the move: the client is treated as part of the API’s contract surface and bumped in the same pull request as any change to a wire shape, so a consumer never has to chase the API by hand.

Then the policy. There’s now a written stability and deprecation document. Breaking changes (removing or renaming fields, changing types or status codes, tightening validation) happen only in majors. Additive changes (new optional fields, new endpoints) happen in minors. A deprecated field is flagged deprecated: true in the spec and survives until at least the next major. The committed openapi.json is named as the canonical contract, and clients are told to pin against it.

The last piece was already there and became load-bearing: an Idempotency-Key header on mutating POSTs, Stripe-style. The same key within a 24-hour window replays the original response instead of running the work twice, and the same key sent with a different body returns a 422 with IDEMPOTENCY_KEY_REUSED.

The same pressure showed up lower in the stack. An API that gets driven by automation gets hit concurrently, and under concurrent writes SQLite’s single writer lock could surface as an opaque 500 (SQLITE_BUSY and the stale-snapshot variant). The 3.0.1 patch added a Prisma client extension that retries those contended writes with jittered backoff, up to five times, and when retries run out it returns a 503 with Retry-After rather than a 500, which is a response a client can actually handle. The per-attempt busy timeout dropped from 5 seconds to 1 so a contended write fails fast into the async retry loop rather than blocking the event loop, and the behavior is tunable through DB_RETRY_* environment variables. In one repro run, eight writers hammering the same database produced 24 failed requests out of 384. A rerun after the fix, at twice the load, produced none. Those counts are a single developer repro, not a benchmark, but the behavior they show is pinned by a contention test that holds a real write lock and asserts a 503, never a 500. Idempotency makes a retry safe to send. This is what makes the failure worth retrying in the first place.

Individually these are unremarkable. Together they’re the difference between an API you call and one you can build on without expecting it to move under you.

The Terraform provider

BridgePort sits between infrastructure that’s already provisioned with Terraform and the services running on top of it, which were configured by hand through the UI. That handoff was imperative: click through screens, run a script, hope it’s reproducible. The provider makes the BridgePort half declarative. Environments, servers, variables, secrets, config files, registry connections, container images, services, and their per-server deployments live in version control, and terraform plan shows configuration drift instead of letting it accumulate quietly. It’s built on terraform-plugin-framework and published to the Terraform and OpenTofu registries through goreleaser with GPG-signed assets, versioned on its own line because one provider release supports a range of platform versions.

The core design choice: configuration is declarative, runtime is not. You can declare a service and everything about how it’s configured. You can’t terraform apply a deploy. Deploys, restarts, and rollbacks stay operations you trigger directly, and runtime facts like health, live status, and exposed ports are read-only values the provider reports but doesn’t manage. BridgePort already kept an internal registry of which fields are runtime versus configuration, and the provider maps onto the same split rather than inventing its own.

Secrets get the same care they get everywhere else. A secret’s value is a write-only argument that never lands in Terraform state. You bump a version number to rotate it, and the plaintext stays in whatever source you pull it from:

resource "bridgeport_secret" "db_password" {
  environment      = "production"
  key              = "DB_PASSWORD"
  value_wo         = var.db_password  # write-only: never stored in state
  value_wo_version = "1"              # bump this to rotate
}

The token authenticates through a BRIDGEPORT_TOKEN environment variable for the same reason, so it stays out of config and state. Resources and data sources are addressed by their natural keys (environment plus name or key), which is also what terraform import works off, and plan runs offline, diffing against the configuration you submitted rather than calling the live API.

The provider builds no HTTP requests of its own. Every call goes through the same shared Go client the CLI uses, pinned at client/v0.4.0, which leaves the provider code to be Terraform schema plus plan-and-state plumbing and not much else.

What keeps a separate-repo provider honest is not the repo layout. It’s the contract enforced in CI. The provider’s acceptance tests run against the real server image: compose up a throwaway instance, mint an operator token from the first-boot admin bootstrap, run the suite with TF_ACC=1, tear it down. SQLite makes each instance cheap enough to do per-run. On the platform side, a provider-compatibility job builds the server image, checks out the provider at its latest release tag, and runs that acceptance suite against the image. A red suite fails the platform’s pull request. Either a change is compatible, or it ships with a documented deprecation and a provider follow-up filed before merge.

The MCP server

The MCP server exposes a curated part of the same API as tools an AI client can call: listing the services showing drift in production, rolling a service back to its previous image, summarizing recent health-check failures. The client’s model does the reasoning, and BridgePort runs the same deterministic operations it always has.

Mechanically it’s a Fastify plugin mounted at POST /mcp, registered only when MCP_ENABLED is set. The flag is parsed strictly and fails closed: only true or 1 turns it on, and when it’s off the route isn’t registered at all, so /mcp returns a 404 rather than an authenticated-but-empty endpoint. The transport is stateless Streamable HTTP from the official MCP SDK. Each request builds a fresh server for the authenticated caller and tears it down when the response closes, with no session store. Authentication reuses the same bearer token as the REST API, and the caller’s token is forwarded on every internal call a tool makes, so each tool effectively replays a real API request as that user. Role checks, scope checks, validation, idempotency, and audit logging all behave exactly as they would for the equivalent REST call. There’s no second permission model to keep in sync. The whole server is about 2,200 lines of new code, and wiring it into the existing auth and idempotency layers changed those by about a dozen lines.

The decision that shaped the whole thing was what not to build. Most ways of adding AI to the project would have BridgePort calling a model itself, for log triage, or drift explained in prose, or a risk score on a deploy. Each of those means holding an API key, paying for inference, and sending logs, config, and topology out to a third-party model. For a self-hosted control plane that already stores SSH keys and secrets, that’s a meaningful amount of new surface and new data leaving the box. The MCP server avoids it by not running a model at all. It serves tools, and the model lives in the operator’s own client, on their own account. That trade is not free: it moves the model, the account, and the client config onto the operator, and it means BridgePort can’t offer log triage or drift-in-prose as built-in features, because there’s nothing on the box to generate them. For a self-hosted control plane that already holds the secrets, that reads the right way around. The usual alternative for keeping data in-house is to self-host a model too, which works but adds a model to serve and a GPU to keep fed. And wiring an agent into an operational loop is a different problem than shipping a chatbot regardless.

Connecting a client is a URL and a token:

{
  "mcpServers": {
    "bridgeport": {
      "url": "https://bridgeport.example.com/mcp",
      "headers": { "Authorization": "Bearer " }
    }
  }
}

The safety model is mostly inherited and partly belt-and-suspenders. Read tools work with any valid token. Write tools (deploy, restart, rollback, run a backup) need a write scope, carry MCP destructive annotations so the client asks for confirmation, and derive an Idempotency-Key so identical calls within about a minute dedupe as retries instead of firing twice. On top of that, every tool output and resource read passes through a redactor before it leaves the server. Its denylist is generated from the Prisma schema (encrypted columns, their nonces, token hashes, raw SSH keys, agent tokens) and applied by key name, recursively, through nested objects. If a REST route ever regressed and started returning a raw database row, the redactor would still strip the secret-bearing fields, while keeping the presence-only flags like hasToken that the safe projections are meant to expose. The endpoint is off by default, has no UI toggle (enabling it is a deployment decision, not a database setting), and supports DNS-rebinding protection through an explicit MCP_ALLOWED_HOSTS list.

Why 55 tools and not 251

The quick way to build either surface is to let the spec decide it. Point Speakeasy at an OpenAPI document and it emits one MCP tool per endpoint. Point HashiCorp’s generator at one and it scaffolds a Terraform resource per route. BridgePort generates the tool schemas from the spec but doesn’t let the spec decide the tool set. The API has around 251 routes. The MCP server exposes 55, 47 read and 8 write, and an admin page reads the live inventory out of the in-process registry so the exposed surface is auditable whether or not the server is currently enabled.

It’s the same reasoning as leaving deploys out of the Terraform provider. Every endpoint as a tool is a dump of what the API can do. A smaller set is a decision about what an agent should do with it. It’s a known failure mode. The semantics matter more than the coverage, and 251 tools is also a context-window problem: the agent spends its attention reading a menu instead of planning, and plans worse for it. Curating down is not a smaller surface dressed up as a virtue. It is the optimization. Creating resources is left to the Terraform provider, the first version of the MCP server is scoped to observing and a safe set of operations, and the safe-write set grows from there.

The rest of 3.0

Not everything in 3.0 was about the contract. The web UI was rebuilt on a standard component library, the front-end bundle was split into lazy-loaded chunks, and a configuration audit surfaced a batch of settings that had been hardcoded and removed a few the app had been silently ignoring. Useful, unglamorous, the kind of work a major version is mostly made of.

Backups got the change that was overdue. Retention used to be flat: keep everything for N days, then delete it. That forces a bad trade, because 90 days of nightly backups is 90 files, and 7 days leaves no medium-term recovery point at all. 3.0 replaced it with grandfather-father-son rotation, the scheme borg, restic, and Time Machine have used for years: keep the last few, then thin older backups down to daily, weekly, monthly, and yearly tiers. Keeping 7 daily, 4 weekly, and 6 monthly backups spans half a year in roughly 17 files instead of 180. It ships as presets (lean, balanced, long-term) with a global default each database can override. Manual and pinned backups are exempt, a floor guarantees it never deletes down to nothing, and it prunes nothing at all until you opt in. This is table stakes for anything storing real data, and now it’s in the box rather than in a cron job someone wrote once and forgot.

The documentation went live with the release, at bridgeport.bridgein.com: installation and getting-started, a guide per subsystem, a Terraform guide and an MCP reference for the two surfaces this piece is about, the API stability policy, and a full API reference generated from the same OpenAPI spec the release hardened. The contract ended up documenting itself.

Once the API is a stable dependency, a new surface is mostly a curation problem, not an integration one. The CLI, the Go SDK, the Terraform provider, and the MCP server are all clients of the same definition, and not one of them had to rebuild validation, permissions, or audit. They inherited it. It’s the same reason the layer around a model usually matters more than the model. The features are the visible part. The contract is the part that took the time.

The Moat That Walks Out the Door

joaofogoncalves — Sun, 14 Jun 2026 00:00:00 GMT

Part two ended somewhere uncomfortable, though it read like a win. The moat is not the harness, and it is not the skill files either. Those depreciate and those copy. What is left, the part a competitor cannot lift out of your repo, is the judgment of the people who were there when the system broke. It lives in the team, not the repo.

True. It is also the one a company can least protect.

Walk back through what the series has been calling durable. The model is rented, and it improves on someone else’s schedule. The harness is owned, and it rots on yours, shedding a gate every time a model release makes one redundant. Both of those change slowly and predictably. The moat we landed on does neither. It does not depreciate on a roadmap or get productized by a vendor. It resigns. It takes a better offer. It burns out in Q3 and goes quiet until spring.

The one thing you cannot rent and cannot buy turns out to be the one that can leave on its own.

So the question part two left open is the one that actually keeps you up. How do you own a moat that has legs?

Write it down, then

There is an obvious answer, and a whole industry sells it. Capture the tribal knowledge. Kill the bus factor. A runbook for every gate, a postmortem for every incident, a wiki page for every decision, and the knowledge stops living in one person’s head.

For most of what a team knows, that is right. Write it down. The architecture, the runbook steps, the config, the timeline of what actually happened: all of it belongs on a page, and a team that skips that work is not guarding a moat, it is just undocumented.

But the series has already walked us into the problem with it. The thing we are trying to preserve here is not a fact. Facts you write down and you are done. This is judgment, and the most you can write down about judgment is the answer it reached last time.

Write a piece of it down completely enough to hand over, and what you have captured is a snapshot: what was true against one model, one architecture, one quarter’s failure surface. The harness depreciates, and a frozen record of why you built it depreciates faster, because it cannot tell you which of its reasons still hold. The week a model ships, the page explaining last year’s gate is worse than no page. It reads as current. It is not.

So far this is the case against the wiki, and every engineer nodding along has already made it. But notice what the case actually shows. The problem is not that judgment cannot be written down. Plenty of it can. The problem is that what you write down is last release’s answer, and the model does not hold still. Part one already named the real asset and part two sharpened it: the moat was never the harness you have, it is how fast you rebuild it when the model moves. The moat is a rate.

A rate does not live on a page, and it does not survive in a single head either. It fails two ways, and they are not the same problem with the same fix. Left unused, it slows. The senior who had it, then moved off the surface and stopped re-deriving anything, is back to reading last year’s answers like everyone else. And lodged in one person, it is a rate with a notice period, which is the failure the title is about.

So the question part two left open splits in two. You keep the rate exercised so it does not slow. You spread it so it cannot walk out. First, though, why neither a page nor the model can hold it for you. Then the two fixes, and the reason a team makes neither until it is too late.

What a runbook can’t hold

This is not an argument against writing things down. It is an argument about what writing things down can hold.

Michael Polanyi named it sixty years ago, in a line that has outlived most of what surrounded it: we can know more than we can tell. A diagnostician cannot fully explain the read that took thirty years to build. A senior engineer cannot fully explain why a passing test on the billing path still makes them reach for a second look. They can give you the rule. They cannot give you the thousand cases that taught them when the rule does not apply. The economist David Autor gave the idea a name for the automation age, Polanyi’s Paradox: the work hardest to hand to a machine is the work whose rules we cannot fully state, because we never held them as rules.

Documentation catches the what. The skill file, the gate config, the runbook step, the postmortem timeline. All of it real, all of it worth having. What it cannot catch is the rate: the speed at which someone reads a new release and re-decides which of those entries still hold, which gate was a fluke and which was the model telling you something, which guard has quietly become latency and which is still load-bearing.

Make it concrete. A model ships and you audit the gates. The runbook for the billing path was updated last release, so it is not stale: this gate routes billing diffs to a human because the model could not grade them safely. Then the new model lands and the billing evals come back greener than they have ever been. Now the page is a question, not an answer. Does the gate still earn its place, or did the release just pay it off? The page cannot tell you, because it records what was decided, not how to decide it again. Answering means re-running the reasoning against the new model: was this gate about the model’s weakness, which a better model voids, or about the cost of being wrong on billing being asymmetric, which no release touches? That distinction is not hard to state. It is hard to redraw every release, fast, before you either cut a gate that still matters or keep one that has quietly become latency. The asset is not the sentence in the wiki. It is the speed of drawing the line again.

Your ledger of incidents records the answer to the last failure. It cannot record the instinct for the next one, the failure nobody has seen yet, the one that matches no entry in the book.

This is where the AI-era objection lands, and it deserves a straight answer. The model can read the whole ledger in a way no engineer can, and a frontier model does more than recite it back, it generalizes past the entries. Autor named exactly this in the paper that named the paradox: modern machine learning is the project of overcoming Polanyi’s Paradox, inferring from enough examples the rules we apply but cannot state. So won’t the model just learn the judgment and hand it back?

Use it. It will help you re-derive faster, and you should let it. But the same model lands in your competitor’s repo the same week, improving on a schedule neither of you sets. What it gives you both is a better engine. What it cannot give either of you is the rate at which your team turns that engine on your own novel failures and redraws your own lines before a customer finds them. The ledger and the model are inputs to that rate. They are not the rate. Autor, who named the paradox as the thing machine learning was trying to dissolve, looked straight at that project and concluded the work demanding judgment was the part that held. The reason is narrow and it is ours: the next failure matches no entry yet, so meeting it is not retrieval, it is re-derivation, done in the moment against a model that just moved under you.

So the asset is not the instinct sitting in someone’s head, which is just another snapshot. It is the rate at which the people working the surface keep redrawing the line as the model moves. They are not a documentation gap waiting to be closed. They are the moat, in the only form it takes: judgment in motion, fast enough to keep up.

Onboard into the incident

A rate is not a fact you can hand someone. It is a skill, and you build a skill the way this one got built in the first place. Through exposure to failure.

That sounds soft. It is the most concrete thing in this piece.

When the next incident hits, the move most teams make is to send in whoever can fix it fastest. Of course they do. The system is down. But the fix is the cheap part, and the person doing it already has the rate. The expensive, durable thing is the reps, and most teams let them evaporate. Put someone newer on the call. Not to watch, and not to take dictation. To hold the pen and make some of the calls, while the one who is fast says why this and not that and what to look at next. They are not memorizing your answer; that would be the snapshot again. They are building their own speed at finding it. The outage is the curriculum. You will not build one this good on purpose.

The same logic runs through the re-derivation itself. Auditing the harness when a new model lands, deciding which gates to cut and which to keep, is the work that sets the rate. It is also usually done fast, alone, by the one person who can. That is how you get a bus factor of one on the exact capability the series called the moat. Rotate the pen. Make the audit something two people do together, and change who leads it each release. It is slower the first few times. It is how the rate ends up in more than one head.

There is a problem with learning from incidents, and a skeptic names it fast: on a healthy system they are rare. You cannot keep a second person sharp by waiting for the next high-severity failure on the expensive path, because if you are running well, one may not arrive for a quarter. So you stop waiting. The ledger changes job. Stored as a record, it is a filing cabinet nobody opens. Run as a drill, it is the closest thing you have to a flight simulator. Re-run an old incident with someone who was not there, let them make the calls, and only then show them where the real one went. Pilots rehearse the engine failure they will probably never see, on purpose, on the simulator. The entry was paid for once. Nothing says you can only spend it once.

None of this scales to every engineer and every incident, and it is not meant to. Most tickets are routine, and routine is what the runbook is for. The high-touch version is reserved for the part that earns it: the non-deterministic core, the harness audit the week a model lands, the high-severity incident on the path where being wrong is expensive. That is a thin slice of the work, and the slice that is the moat. The point is not to slow everything down so juniors can watch. It is to stop spending the rarest work in the building on an audience of one.

Hire for the slope

This is where the argument cashes out, because hiring is the one decision that sets your rate for years, and it cuts against how most teams write the job description.

You cannot hire someone who already holds your judgment. Nobody has it. The incidents that built it happened in your production, against your data, with your customers finding the failure modes only your product has. The most experienced engineer on the market arrives with deep judgment about systems that are not yours. That is worth a lot. It is not the same thing.

So the trait that matters is not how much a candidate already knows about harnesses. It is how fast they can take on an incident they never lived through and start producing judgment of their own. Absorption rate, not inventory.

This is the same direction the filter has been moving for a while now. As models close the gap on syntax, the thing that separates engineers is system-wide reasoning, the ability to hold the whole system in their head and reason about where it breaks. You are not hiring for the code anymore. You are hiring for the slope: how quickly someone goes from not having your context to having it. That slope is what turns a new hire into part of the moat instead of a standing drain on the people who already are.

The trouble is that slope does not show up on a résumé, and most interview loops are built to measure inventory. You can measure it directly, but not by asking a candidate to guess at a system they have never seen. Give them enough to reason with: a redacted incident writeup from your own history and the slice of the current harness around it, the gates, the retries, what routes to a human and what does not. Then ask what they would look at first, which gate they would re-examine after a model change, and the single question they would put to the on-call to decide whether it still holds. You are not scoring the answer. It is your system and they cannot know it. You are watching how they build the question: what they reach for, which failure modes they suspect, whether they can reason about where this breaks without having lived in it.

What you are listening for is specific. Do they surface the unstated assumption behind the old gate, the business fact it was really protecting, in the first few minutes, or do they argue from the eval numbers in front of them. Do they reach for the past incident least like the current symptom and rule it out, or pattern-match to the nearest one. And watch the confound: a candidate who has run a system like yours will look fast because they are recognizing their old one, not re-deriving yours. Probe a part deliberately unlike anything on their résumé, where speed has to come from reasoning and not memory.

The proof comes in the first ninety days, which is also how you learn whether your interview read slope or fooled itself. Not how much they shipped. How soon they could lead a harness audit without a chaperone, take the pen on a real incident, and have it go well.

The engineer who can absorb your ledger in a month and the one who needs a year are not the same hire, even with identical résumés. One compounds the moat. The other borrows against it.

Nobody is paid to do this

Everything above is correct and almost no one does it, and that is not because managers are careless. It is because every one of these moves bills the wrong account at the worst possible time.

Putting a newer engineer on the incident slows the fix while the system is down. Rotating the audit hands it, half the time, to the slower of the two people who could run it. Re-running an old case spends senior hours on a problem that is already solved. Each cost is immediate, visible, and lands this quarter. The payoff is a bus factor you do not need until someone leaves, which is exactly the benefit a quarterly review cannot see. Sending the best person in alone to fix it fast is not a lapse. It is what optimizing for this week returns, every week, and a bus factor of one is what it compounds into.

There is a second cost, quieter, and it explains why this cannot be solved by decree. The person who holds the moat knows it is a moat. Asking them to spread it is asking them to make themselves more replaceable, deliberately, for an institution that does not always return that kind of favor. That only happens when the spreading is seen and valued as the senior work it is, not booked as overhead between real tasks. A moat moves to a second head when the first head has a reason to let it.

The moat has legs

So walk the whole series back out.

The model is rented. It gets better on someone else’s schedule, and you take the upgrade when it lands. The harness is owned. It rots on yours, and the work is keeping it from rotting. And the moat underneath both of them, the judgment that decides which gate to cut and which to keep, lives in people. Which means it has legs. It can quit.

You do not protect it by writing it down, though you should write down what you can. The page holds last release’s answer and the model holds everyone’s; neither holds the rate at which you redraw the line when the ground moves. You protect a rate the two ways any practice is protected: you keep it exercised so it does not slow, and you spread it so it does not live on one person’s calendar. That means treating every incident as a curriculum and not only a fix, every model release as a drill and not only a sprint, every hire as a question of how fast someone reaches your speed. None of it is free, and all of it loses to this quarter unless someone decides it will not. The moat does not last because you locked it in a document, or because one person guards it. It lasts because more than one person is fast, and because the next one is getting faster.

A better offer takes a person. It takes the moat only if you let the moat ride out the door in one head. The harness, you can lose to a better model. The judgment, you can only lose to a worse manager.

Rent the Loop, Build the Moat

joaofogoncalves — Thu, 11 Jun 2026 00:00:00 GMT

Part one ended on a line that is easy to nod at and hard to act on: the moat is not the harness you have, it is how fast you can rebuild it when the model moves. True. Also useless on a Monday.

Nobody ships “how fast you can rebuild it.” They ship a decision about what to write themselves and what to pull off a shelf, then they live with that decision for a year. So this is the part the argument skipped. Where the line goes.

Draw the line

The build-versus-buy question got loud this year. There is a small industry of decision frameworks for it now, most shaped the same way: buy a platform for ninety percent of cases, build only when the agent’s logic is a real moat. The advice is not wrong. The cut is.

It treats the agent as one object you either purchase or assemble. That object does not exist. What exists is a loop with a harness around it, and the two belong on opposite sides of the line.

The loop is generic. A model decides the next step, a tool runs, the result comes back, the model decides again. Around that: a sandbox to run in, a session log to remember what happened. None of it is specific to you, and the people who build the model will now run all of it for you. The Claude Agent SDK lets you run the same loop that powers Claude Code inside your own process. Managed Agents runs the loop, the sandbox, and the session log on Anthropic’s machines for eight cents a session-hour, with tokens billed on top. Their own framing for it is decoupling the brain from the hands. Rent the hands. They are a commodity, and they are someone else’s problem now.

There is a clean test for which side of the line a thing sits on. If it is knowable from outside your company, rent it. If you only learned it by being in your own production when something broke, build it. The loop is knowable from outside. Your deploy gates are not. The permission boundary drawn around your blast radius is not. The verification that knows what correct means in your product, on your data, against the way your customers actually use it, is not. No SDK ships with any of that, because none of it is visible from where the SDK was written.

Most of what an agent system costs over its life is not the build. It is the maintenance, and most of the maintenance is chasing the model as it shifts under you. The frameworks file that under cost and warn you about it. Part one filed the same fact under moat. The maintenance is not the tax you pay for owning the harness. The maintenance is the thing you own.

Rent the loop. Build the harness. The line runs exactly where your domain starts.

The evaluator you can’t rent

If you build one thing, build the part that decides whether the work is good.

It is the part everyone underbuilds, because the lazy version is one line: ask the model if the work is good. The model says yes. Anthropic watched this in their own harness and named it without flinching. Models “tend to respond by confidently praising the work, even when the quality is obviously mediocre.” The thing grading the work is the thing that made it, and it is agreeable by construction.

So the evaluator has to be separate, and it has to grade like a compiler instead of a manager. A compiler does not care how confident the submission is. It runs the check and returns pass or fail. The distinction that makes it real: the evaluator interacts with the running system instead of reading the diff. It clicks the button, hits the endpoint, reads the actual error. A diff that looks correct and a feature that works are different claims, and only one of them ships. The whole loop rests on that: a failure you can detect is the only kind the system recovers from on its own.

In practice this is the least glamorous code in the system and the part I trust most. My agents fail their first CI run roughly seven times in ten, read the red build, and fix themselves before a human looks. That thirty-percent first-pass rate is not a number I am proud of, and it is not supposed to be. It is the evaluator earning its keep. The number that would worry me is not the seven that fail; it is any of the three that pass and shouldn’t have. A gate is only worth trusting if its green means green — the failures are cheap, the false pass is the one that reaches a customer. The gate catches the confident, plausible, wrong output and sends it back, and the loop closes in tokens instead of in a stand-up.

The gate that matters most is the one that refuses to grade itself at all. A green diff that touches the billing path does not merge on a passing test. It routes to a human. Not because the model is dumb. Because the cost of being wrong there is asymmetric, and the asymmetry is a fact about my business that no model knows. That rule is a few lines of config and it is pure harness. It will be true on the next model, and the one after that.

Skills are a ledger of failures

People ask about the twenty-seven custom skills the way they ask about prompts. What’s in them. What’s the trick.

There is no trick. Each skill is an incident.

The system did something wrong once, in a specific way, on a specific kind of work, and instead of writing a cleverer prompt, someone wrote down what it should have known and made that knowledge load-bearing. The skill for resolving merge conflicts exists because two agents corrupted each other’s work. The skill that triages a Sentry exception before filing anything exists because the system once opened a handful of issues for one root cause. The agent roster grew the same way: every specialist is a generalist that kept making one class of mistake until the fix earned its own definition.

This is the part that is genuinely yours, and it is yours for a reason that has nothing to do with secrecy. A competitor can read every skill file in the repo. They cannot read the outage that taught you to write it. The file is the answer. What you own is the question it answers, and you only got the question by being in production when it broke.

The skills are a ledger. Every entry was paid for once.

The week a model ships

Here is the cadence the benchmark ritual replaces with a single API call.

A new model drops. The cheap move is the one most teams make: swap the string, run the eval, ship. The move that compounds takes about a week and looks like maintenance, which is exactly why it gets skipped.

You read your own harness as a list of bets. Every gate, every retry, every reset is a written-down assumption about something the model could not do on its own — and as Anthropic puts it, those assumptions “go stale as models improve.” A better model just paid out some of those bets and voided others. Their own harness is the cleanest example on the public record: the context resets that Sonnet 4.5 needed became dead weight on Opus 4.5, so they dropped them. The harness got smaller because the model got better, and someone had to notice.

So you audit. Some gates die because the model got better; some die because your own tooling did, and it pays to know which.

The first kind is a real bet on the model’s limits. My agent loop used to stop and ask me to approve each step, because I did not trust it to run unattended. A stronger model earned the trust, the per-step prompts came out, and the loop went autonomous. That was the model paying out a bet I had made against it.

The second kind was never about the model at all. My coordinator used to hand work to its sub-agents through files on disk — a spec file, an issue file, a review-output file — because that was how state moved between them. Then a small tooling change let each agent read the issue and post its own result directly, and the files were gone. No model release did that. It was scaffolding I had mistaken for structure.

Only the first kind compounds, which is why the audit runs as a routine and not a cleanup. Deleting a dead gate is the work most teams skip, because a gate that no longer does anything still reads as safety.

Then you go looking for the new failure surface, because there always is one. A model that writes better code fails in subtler ways. It is confidently wrong about more sophisticated things. The gate that caught last year’s mistakes will not see this year’s. Finding the new edge before your customers do is not panic; it is a routine you run every release, and each time you run it you pull a little further ahead of the team relearning it from their first outage.

None of this is free, and it is not for every team. Running the audit every release is a standing tax, and a smaller shop will pin the model for six months and eat the staleness instead — often the right call. But staleness is not static. The model keeps moving whether you audit or not, the un-audited harness keeps rotting, and the distance between the team that re-derives and the team that waits is set by a release cadence neither of them controls.

The line keeps moving

There is an obvious objection to all of this. The labs are already selling the part I called un-rentable. Managed Agents productized the loop, the sandbox, and the session log a year after everyone started writing them by hand, and the line between rent and build does not hold still. It moves outward every release, and verification on your own data is the next thing they reach for.

So why bet on the evaluator surviving? Not because a vendor can’t build an evaluation framework. They will, and you should use it the day it ships. The reason is narrower and harder to copy: what makes your evaluator correct is the ledger of incidents behind it, and that ledger is written in an ink the vendor cannot read. Your outages. Your data. The way your customers actually break things. They can ship the frame. They cannot ship the contents.

It lives in the team, not the repo

So what is the asset, exactly.

Not the harness. Part one already conceded that one: it depreciates, the model eats pieces of it, the leaner version next quarter is the better one. Not the skills files either; those copy too.

The asset is the judgment that decides which gate to delete and which to keep. That judgment does not live in the repo. It lives in the people who were there for the incidents, who know which failures were flukes and which were the model telling them something, who can read a new release and re-derive the harness in a week instead of relearning it over a quarter of outages.

You could hand a competitor the entire system and they would still be behind, because what you built was never the code. It is a team that knows where this specific system breaks, and how fast it stops breaking that way.

The model is rented, and it gets better on someone else’s schedule. The harness is owned, and it rots on yours. The work is keeping it from rotting. The people who can do that work the week the model moves are the only part of this that was ever a moat.

The Harness Is the Moat

joaofogoncalves — Mon, 08 Jun 2026 00:00:00 GMT

There is a ritual that runs in most engineering orgs right now. A new model drops. Someone reads the benchmark card, runs it against an internal eval, posts the delta in a channel, and files a ticket to swap the API string. The model is treated as the variable. Turn the knob, get more capability, ship better agents.

The teams actually running agents in production barely touch that knob.

They are not indifferent to the model. They will take the better one when it lands. But they know something the benchmark ritual hides: the model was never the thing standing between them and a working agent. The thing standing between them was everything around the model. The state coordination, the permission boundaries, the failure recovery, the verification gates, the deploy path. The unglamorous layer that turns one good completion into a system you can leave running overnight.

That layer has a name. Anthropic calls it the harness, and they put it in the title of an engineering post about long-running agents. The name is worth keeping, because once you have it, the part of the work that actually decides whether an agent ships comes into focus.

The model is the commodity. The harness is the moat. Both halves of that sentence need defending, and the second half is more slippery than it looks.

What a harness actually is

A harness is not a prompt and it is not a framework you install. It is the running system that lets a model do useful work without a human holding its hand through every step.

Concretely, in the system I run, the harness is the part that does this. It coordinates state across parallel working directories, so four agents can edit the same repo at once without writing over each other. It handles permissions and isolation, so an agent that goes off the rails can’t touch production credentials or delete a branch it shouldn’t. It recovers from CI failures, reads the red build, diagnoses the cause, and pushes a fix without waking anyone. It gates output behind verification, so nothing merges that hasn’t passed a check the model didn’t get to grade for itself. It recovers from partial failures, picks up a feature that died halfway through three agents ago. And at the end, it deploys.

The model is one call inside that loop. An important call. Not the loop.

This is the part the benchmark card cannot show you, because the benchmark measures the call and the harness is everything between the calls. Take the biggest model jump on the board right now. Opus 4.8 landed in late May about ten points clear of the field on SWE-bench Pro, one of the hardest coding benchmarks in current use. That is a real gain, the kind worth swapping the API string for. It also does not coordinate worktrees. It does not know your deploy gates. It does not remember that the last agent left the migration half-applied. The model improves along one axis. Your agents fail along a different one. The two barely intersect.

That orthogonality is the whole argument.

The model improves. The breakage doesn’t move.

Look at where the frontier actually sits. The benchmarks the industry leaned on two years ago are saturated. MMLU clusters in the low 90s across every serious model, close enough that the gaps read as noise. On the chat arena leaderboards the top models sit within a point or two. The newer, harder benchmarks still spread out, Opus 4.8 opened a real lead on SWE-bench Pro this spring over GPT-5.5 and Gemini 3.1 Pro, but look at what that lead buys you. A model that codes better on its own still does not coordinate worktrees, still does not know your deploy gates, still does not remember the half-applied migration. The gain lands on an axis that does not touch the one where your agents actually fail.

Now look at where production breaks. The failures are specific, and almost none of them are the model being dumb. I have spent real time diagnosing harness failures in Claude Code itself, the tool I build on. A git worktree that follows its .git pointer back to the main repo and registers every slash command twice, so the agent sees a duplicated menu and picks the wrong one (issue #26992, closed as not-planned). An agent team that crashes when it hits a permission boundary mid-run instead of degrading gracefully. State tracking that loses the thread across worktrees.

It is worth being precise about what those are. Claude Code is itself a harness, one I rent from Anthropic, and those are its bugs, not the model’s. A harness has layers: the loop the vendor ships, and the layer I build on top of it. Failures show up at both. The point survives the distinction. None of them are the model. The model was fine. What broke was the scaffolding, and the scaffolding is where the work lives.

The data says the same thing at scale. By some estimates around 80% of AI projects fail to deliver value, and when you read the postmortems the cause is rarely that the model wasn’t smart enough. It is abandoned before production, completed but worthless, can’t justify the cost. Practitioners writing honestly about agents converge on the same diagnosis: they fail in production because orchestration got treated as an afterthought, and teams end up rebuilding session state, memory, and tool routing every couple of months as the model shifts under them.

That last detail is the one to hold onto. The harness is not a fixed asset you finish and own. It moves under you. Which reads like an argument against everything I just said, until you follow it one step further.

Grade the output like a compiler, not an employee

The hardest part of a harness is the part people skip, because it is the least fun to build and the easiest to fake.

Verification.

Here is the trap. You ask the model whether the work is good, and the model tells you it is good. Anthropic ran straight into this building their own harness and named it plainly: models “tend to respond by confidently praising the work, even when the quality is obviously mediocre.” Self-assessment does not work, because the thing doing the assessing is the thing being assessed, and it is agreeable by construction.

So the harness cannot trust the model’s self-report. It has to grade the output the way a compiler grades code, not the way a manager grades an employee. A compiler does not care how confident the submission is. It runs the check and returns pass or fail. The harness needs a separate evaluator that interacts with the running system, clicks the button, reads the actual error, and fails the work when the work is wrong, regardless of how good the diff looked. I’ve written before about why detectable failure is the assumption the whole agent loop rests on: the patterns that scaled human teams transfer cleanly to agents, except the one that quietly assumed a human would notice when something broke.

The economics of getting this right are not subtle. In Anthropic’s own comparison, a single model run cost nine dollars and produced a broken application. The multi-agent harness, with a separate planner, generator, and evaluator, cost two hundred dollars and produced one that worked. Twenty times the spend, and it was the cheap option, because a broken app costs more than two hundred dollars to discover in production.

That is the discipline in a sentence. You are not building around what the model does well. You are building around what it does badly, and every gate you add is a place you decided not to trust it.

The hard part was never the prompt

Let me put my own receipts on the table, because the argument is cheap without them.

At BRIDGE IN I run a fourteen-agent orchestration system that ships full-stack features end to end. It plans, implements, tests, and delivers. It triages Sentry exceptions and opens its own issues when it finds a real one. It recovers from CI failures on its own, and that one is not a flourish: it fails its first CI run about seven times in ten, then reads the red build and fixes itself before a human looks. The thirty-percent first-pass rate is the least flattering number I track and the one I point to first, because it is the recovery harness doing precisely the job the model cannot do for itself. It merges its own pull requests, most months with no human commits on the branch. It deploys to production. The whole thing runs through twenty-seven custom skills I wrote, each one a small contract for a specific job the system needs done reliably.

People ask what prompts I use. It is the wrong question. The prompts took an afternoon. The system took months, and almost none of those months went into prompting.

They went into the harness. Into figuring out how three agents share a repo without corrupting each other’s work. Into deciding what an agent is allowed to do without a human in the loop and what it is never allowed to do. Into the verification gates that catch the confident, plausible, wrong output before it reaches a customer. Into the recovery paths for when an agent dies mid-feature and the next one has to figure out where it was. The orchestration problem was the actual hard part long before the model was good enough to make it worth solving.

This is also why “what’s the best prompt” is the wrong frame entirely. The skill that matters is designing the system the model operates inside. The prompt is a line in a config file. The system around it is the part that compounds.

Won’t the model just eat the harness?

Here is the strongest objection, and it comes from the same Anthropic post I keep quoting. Every component of a harness, they write, “encodes an assumption about what the model can’t do on its own, and those assumptions are worth stress testing, both because they may be incorrect, and because they can quickly go stale as models improve.” Read the second half slowly. A harness is, in part, a list of the current model’s weaknesses written down in code. Models get better. Some of those entries expire on the next release.

This is not hypothetical, and they show their own work on it. They built context resets into their harness because Opus 4.5 got anxious as its context filled up. Opus 4.6 mostly stopped doing that, so they deleted the resets entirely. The scaffolding that was load-bearing in one version was dead weight in the next. The same post has a sequel to the two-hundred-dollar run, too: on the newer model they built a full audio workstation for a hundred and twenty-four dollars, sustaining hours of coherent work after stripping out the structure the older model had needed. The model improved and the harness shrank to meet it.

It is not only the model side. The labs are now selling the harness directly. Anthropic shipped Managed Agents in April, a hosted service that runs the agent loop, the sandbox, and the session log on their infrastructure for eight cents a session-hour. The generic orchestration layer, the part everyone was writing from scratch a year ago, is becoming something you rent.

So if the components depreciate and the plumbing commoditizes, where is the moat?

Not in any single gate. In two things the next model release cannot touch and the hosted runtime does not ship.

The first is the part of the harness that encodes your domain instead of the model’s limits. Your deploy gates. Your permission boundaries. The verification that knows what correct means in your product, on your data, against the way your customers actually use it. Managed Agents will run the loop. It will not know that a green diff touching the billing path still needs a human, or that your migrations have a failure mode the last three agents already hit. You earned that knowledge in production, and no model upgrade ships with it preinstalled.

The second is the one that compounds. Because the harness depreciates, the durable skill is not owning the right harness. It is re-deriving it faster than anyone else every time the model moves: knowing which assumptions just went stale, which gates to delete, which to keep, where the new failure surface opened up. That is not an artifact you build once. It is a capability that lives in the team, and it sharpens with every release while everyone still treating the model as the variable is learning the new failure modes from their first outage.

What you actually own

AI made writing code cheap. When the cost of one thing falls that far, the value moves to whatever is now scarce, and what is scarce is everything the cheap part still can’t do for itself: coordination, verification, knowing what to build and proving it works without a human watching. That is a description of a harness, not a better model.

So take the upgrade when the model drops. It is free capability and you should want it. Then go back to the part that was actually keeping your agents out of production, and stop thinking of it as a thing you finish. The model you rent improves on someone else’s roadmap. The harness you own decays on yours, and the work is keeping it from decaying: pulling the gates the new model made redundant, finding the new failure surface before your customers do.

The model is the commodity. The moat is not the harness you have today. It is how fast you can rebuild it when the model moves.

The Real AI Skill Isn't Prompting. It's System Design.

joaofogoncalves — Tue, 26 May 2026 00:00:00 GMT

I sat through one of these last quarter. Over twenty people on a video call (the “conference room” was a Zoom grid), a slide deck with the OpenAI logo cropped slightly wrong, a vendor walking everyone through a prompt template for writing emails. Subject: marketing campaign for Q3. Tone: professional, friendly. Paste. Hit return. ChatGPT produced six bullet points. A few thumbs-ups landed in the chat. The vendor moved to slide forty-one.

Everyone left feeling productive.

The training was real, the slides were fine, the prompt template worked. None of it would have taught anyone how to use AI on anything that compounds. The visible surface is the part that sells as a course. The skill is one layer down.

01 — The visible surface

There is a default shape to AI literacy training in 2026. Show the demo. Hand out a prompt template. Have people paste it into ChatGPT and rewrite an email. Maybe a tool tour at the end: here is Claude, here is Gemini, here is your Microsoft Copilot license.

The format is legible. It demos well. It fits a slide deck. You can sell it as a half-day course and book it through procurement.

To be fair: the email-rewriter does work. The summarize-this-meeting prompt does work. Lifting the median knowledge worker on small, well-bounded tasks is a real use case, and the prompt template is roughly the right shape of training for it. The problem is what gets confused with that. The same procurement line item is being sold to engineering leaders, product orgs, and operations heads as if it would do the work that decides whether the company has an AI advantage. It will not. That work is different.

MIT’s 2026 NANDA report put a sharp number on the gap: 95% of enterprise generative AI pilots produced no measurable P&L impact inside six months. The methodology has been contested, and the headline is harsher than the underlying numbers warrant, but even the conservative reads agree that only a third of enterprise AI pilots reach production. The model is not the variable that broke those pilots. The system around the model was.

A prompt template gets you a generic output on a small task. Working with AI on anything that compounds looks different. It looks like breaking a problem into pieces small enough to fail at visibly, knowing what good looks like before the model writes anything, telling it what is wrong with what it gave you, and iterating until it isn’t. None of that is in the slide deck. All of it is now the actual work.

The prompt is the artifact. The skill is upstream.

02 — The system the model operates inside

What people are reaching for when they say “prompting” is, almost always, a different and older skill. It has five rough parts.

Scope. You name the problem precisely enough that a stranger could tell whether the output solves it. Not “help me with the deployment pipeline.” Something like: walk me through each step from commit to production, flag the steps that take longest, suggest where we could parallelize. The first one is a wish. The second one is a question.

Decomposition. You break the work into chunks small enough that the failure modes are visible. The model is going to fail. Your job is to design the work so that the failures are caught, not absorbed. A single mega-prompt that says “build me a marketing site” hides every decision the model is making. Six small prompts each producing a verifiable artifact does not.

Criteria. You write down what good looks like before the model writes anything. In measurable terms. Three bullet points. Two paragraphs of source material cited. No hedging language. No marketing-speak. The rubric is what an evaluation harness checks against — without it, you have no way to know whether a run regressed. If you can’t write the rubric, you don’t know what you want, and the model is going to make a confident guess on your behalf.

Feedback loops. You have a way to tell the next iteration what the last one got wrong. This is the part the templates skip entirely. It is also the part that compounds.

Making the implicit explicit. Every time you correct the model in your head and don’t write the correction down, you are training nobody and nothing. Models are stateless and non-deterministic. They do not learn between conversations the way a contractor learns between engagements. The system around them has to carry what a person would have carried in their head. Put the correction in a system prompt, a skill file, a project doc, anywhere the next run will read it. That is what “system” means in this article. Not orchestration framework. The set of inputs the next run sees.

Five things. None of them are about the model. All of them are about the design around the model.

This is not new in shape. Scoping, criteria, and structured feedback are familiar from any well-run engineering organization. What is new is what carries the learning. A junior engineer remembers a code review three weeks later. A model does not. Skill files exist precisely because nothing else does the carrying.

03 — The floor and the ceiling

I have been running prompting sessions for over 6 months. The pattern is consistent.

The floor is structural. People who already know how to describe work in measurable, scoped, criteria-bearing terms become functional with almost any model inside an hour. The transferable competence is mostly older than the technology.

A head of support I worked with started a session with “I want to get more insights out of HubSpot.” Fifteen minutes later, with two questions from me, she had written: pull every support ticket from the last 90 days, group them by the product surface the customer was asking about, count tickets per surface, and surface the three themes with the steepest volume growth over the period. She did not learn a prompt. She remembered how to write a brief. She has been writing briefs for vendors and consultants her entire career. The model is just the newest contractor.

The ceiling is something else. It is engineering depth — the ability to look at output and know whether it is wrong. Structural skill gets you a working interaction. Depth is what catches the failure that compiles cleanly, passes the rubric you wrote, and ships your blind spot at scale.

I wrote about this in AI as the Great Filter. The short version: engineering depth used to be a nice-to-have. Now it is the variable that decides who survives the next layer. The engineers who already had the muscle — architecture decisions, code review, design docs, naming a failure mode before someone hits it — compound. The ones who treated cheap AI as a substitute for that muscle ship fluently-broken code at scale.

The trap on the engineering side is the interesting one. Technical literacy looks like depth and is often the opposite. Knowing the stack makes it tempting to skip system design and just write the code yourself. AI made that trap deeper, not shallower. The engineer who used to write a function in 30 minutes now writes it in 90 seconds with Claude. The gap from spec to output is almost entirely verification. If you can’t articulate what good looks like before the model produces it, your verification reduces to “looks right.” That ships your blind spots faster, with more confidence, in larger volume.

A two-hour prompt-engineering workshop teaches the model interface. It teaches nothing about how to know whether the output is wrong. The engineer who could already eyeball a design and say “this won’t work under load” gets faster. The engineer who could not gets a paste-bin of plausible code and a working keyboard.

The floor is structural. The ceiling is engineering depth. Most workforce AI training is failing at the floor and ignoring the ceiling entirely.

04 — aiBerto, the receipt

We have an AI engineering agent at BRIDGE IN. His name is aiBerto. He lives in our Slack, picks the next issue off our project board every couple of hours, opens a PR, runs CI, fixes failing checks, asks for review when he’s stuck. Last month he merged 30 PRs autonomously (zero human commits on his branches), opened 107 GitHub issues, resolved 104, and handled 81 Slack interactions.

He didn’t get smarter last month. The underlying model didn’t change.

He got 40 skill patches.

Forty sentences, written by humans who watched him fail at something specific and figured out what he should have known. Every patch sits in a markdown file in our repo. The next time aiBerto runs the relevant skill, the file gets prepended into his context window. The failure mode stops happening.

A recent one: aiBerto’s /build skill was occasionally merging a PR without explicitly linking it back to the originating GitHub issue. The PR closed the issue via a keyword in the body, which was enough for GitHub but not enough for our project board. A teammate noticed the same failure twice, patched the skill with a one-line “guarantee the PR-issue link before merging” step, and the failure mode disappeared. He did not retrain anything. He wrote a sentence. The sentence is in the repo.

None of those forty sentences are prompts in the popular sense. They are pieces of system. Before merging, link the issue. When CI fails on lint, run the linter locally first. When triaging Sentry, dedupe by stack hash before opening tickets. Each one is a piece of organizational knowledge made explicit, written down, and put on the model’s read path.

The honest question an engineering leader asks here is the regression question: how do you know patching one failure didn’t break three others? Answer: an eval suite that runs the agent against a set of historical issues and PRs before any skill change merges. It catches local regression. It does not catch the harder problem, which is architectural drift across many small clean diffs — the fifth PR through the same area of code surfacing a pattern the previous four missed. The eval handles the failures we already know to look for. The drift is the part we are still writing the system around.

The model is the same. The system around it got better. The agent’s effective intelligence rose because the people closest to each kind of failure added the missing piece of context. That is system design. It is also what AI literacy looks like at scale.

The longer version of this story has the metrics. The shape that matters is this one: forty sentences moved more output than any prompt-template handout ever could.

05 — What a real curriculum looks like

If I were running an honest AI literacy program for an engineering organization, the agenda would look almost nothing like the one I sat through.

Pick a real problem the team actually has. Not a sample case study, not a “draft this email” exercise. A real Sentry bucket. A slow code review queue. An incident post-mortem nobody has written yet.

Before anyone opens a model, write down what done looks like. In measurable terms. The same rubric you would put on a sprint planning ticket. Three bullets. Pass criteria. A test you could run.

Break the problem into the smallest unit the AI can fail at visibly. If the unit is too big to verify, it is too big to delegate. Split it. The unit looks a lot like a small PR.

Write the evaluation rubric yourself. Not the one the model suggests. The model will suggest a flattering rubric. That is what models are tuned to do.

When the output is wrong, write the missing sentence the next run would need to read in order to not be wrong in the same way. Put it in version control. Now you have a system. Now you also have a regression target — the next change runs against the same sentence and the team sees whether it still holds.

Do this for four weeks. The deliverable is not a slide deck. It is a markdown directory of sentences a team’s worth of people wrote because they watched something fail and figured out what it should have known. That directory compounds. The slide deck does not.

The engineering-org version of this argument is adjacent. Bolting Copilot onto a workflow is not AI-first. Bolting prompt templates onto a workforce is not AI-literate. Both confuse the artifact for the work.

Back to the call

The twenty-plus people on that call are not at fault. The vendor is not, mostly, dishonest. The slide deck and the prompt template are not wrong for what they’re trying to do.

The honest version of the same training would have looked almost nothing like it. The vendor would have opened the team’s actual bug tracker. Pulled the next issue. Asked someone what done looked like, in measurable terms, before anyone touched a model. Made them write the rubric. Run the model. Watched it fail at something specific. Asked someone else what the missing sentence was. Committed the sentence to the repo. Run it again. Saved the rubric, the failure, and the sentence as the artifact of the session.

The deliverable would have been a markdown file the team kept, not a slide deck the team forgot.

The skill is older than the model. The model is just the latest reason to teach it.

Building the Road to Production, Again

joaofogoncalves — Wed, 20 May 2026 00:00:00 GMT

The motto

In 2022 someone interviewed me about being Head of DevOps at Valispace. I said the team motto out loud, probably for the first time on camera: “building the road to production.”

We didn’t build the product. We built every tool the team needed so the road from idea to production stayed clear, fast, and safe. Three people, 400 customer deployments, a release every two weeks, four hours to roll the new version across the entire customer base on a Sunday night. Later in the interview I said something that still describes the work: my work is mostly well done when nobody talks about me.

I’m doing the same job again at BRIDGE IN. The road has different traffic on it now.

The current playbook for adopting coding agents treats agentic engineering as a new discipline. The shape mostly isn’t new. The SDLC patterns that scaled a 3-person DevOps team running 400 cloud deployments transfer cleanly to a small product team running an agent fleet. The unit of “builder” changed. The work — mostly — didn’t.

Mostly. Four patterns map. One assumption underneath them breaks, and the break is the whole point of this piece. Teams building the four without confronting the fifth are paying a tax that won’t show up in any dashboard they own for another two quarters.

One control plane

The internal tool I spent 80% of my time on at Valispace was something called ValiAdmin. It was the single place where the company described itself to itself. Customer deployments, internal dependencies, server health, version state, configuration. One interface, managed by code, queried by the rest of the team.

When customer success needed to spin up a deployment, they didn’t page DevOps. They opened ValiAdmin. When support needed to restore a backup, they opened ValiAdmin. When a release went out at 9pm on Sunday, the orchestrator was ValiAdmin.

Three people managed 400 deployments on ValiAdmin.

The general pattern: the system has to describe itself in machine-readable form, in one place, in a way that anyone — or anything — operating it can read and act on. Otherwise the institutional knowledge lives in three engineers’ heads and the team can’t grow past that.

That pattern got new vocabulary in 2026: CLAUDE.md, AGENTS.md, rules files, MCPs, custom skills. Those are the informational half — the documentation an agent reads to ground itself in your team’s standards. They map onto the rules and runbooks ValiAdmin’s UI made obvious. They’re necessary and they’re not sufficient.

The operational half — the thing that knows the current state of every service, every container, every secret, every dependency, and can act on it — is the part most teams haven’t built. Bridgeport, the deployment tool I built recently at BRIDGE IN, is the agent-era equivalent of that operational half. One instance manages every environment, every server, every service, every container image, every encrypted secret, every config file. Deployment plans resolve service dependencies, deploy in order, verify health checks between steps, and auto-rollback the whole chain if anything fails. The state of what’s running, what version, where, and who deployed it is queryable from one UI — and from one MCP, which means a Claude Code agent can answer “what changed on staging in the last hour” with the same authority as the on-call engineer.

The split matters. CLAUDE.md tells the agent what good looks like. The control plane tells it what’s actually true. Most teams I see have invested heavily in the first and almost not at all in the second. The result is agents that know the standards but can’t observe the state — and which therefore confidently hallucinate it.

Self-service for the rest of the team

The companion to ValiAdmin was Valinstall. It was how customers who insisted on on-premise deployments installed Valispace themselves. The goal was simple: get from “we send an engineer to your data center for two hours” to “you run one command and read a README.”

We got it to about 80%. The reason it mattered: some customers literally could not let us see their data. ITAR-regulated defense customers ran Valispace inside their own networks with no visibility back to us. When something broke, the metaphor I used in the interview was that we were fixing an engine on a car through the tailpipe.

The only viable answer was tooling. Better diagnostics they could run themselves. Better installers. Push the surface outward so they could solve their own problems with our tools instead of waiting on us to teleport in.

The agent-era version of this pattern is shaped the same way but aimed at a different audience. Valinstall pushed surface to customers we couldn’t see. Skills, MCPs, and rules files push surface to internal teams an engineer used to chaperone — a PM opening a properly-scoped issue without a roadmap meeting, support triaging a bug report against the codebase, an agent building the boring CRUD page without a senior engineer in the loop.

The audience is different. The shape is identical: the small team scales by widening the surface where other people (or other agents) can act without it. You don’t scale a small team. You scale the surface.

The bottleneck moves otherwise.

The managed boundary

In 2022 I had a clear rule for AWS services: use the managed version unless we found a specific reason not to. The DevOps team had 400 deployments to keep running. We couldn’t afford to also be the DBA team.

The same heuristic applies to AI infrastructure in 2026, but the question has shifted under it. Self-host vs. hosted was the 2024 framing. By 2026 the real CTO-level question is which abstraction you commit to: single-vendor hosted (Anthropic, OpenAI direct), multi-provider routing (a fallback-aware, cost-aware layer on top), or platform abstraction (Bedrock, Vertex). All three are “managed.” The decision is governance posture, not build-vs-buy.

The silent constraint underneath all three is cost-per-team accounting. Most orgs in 2026 don’t have token budgets that map to teams. They have a single bill that grows quarter over quarter and nobody owns it. That’s the same DBA-team problem in new clothing — the thing you delegate still needs an owner, even when the owner is delegating the implementation.

The carve-outs for going custom (self-hosted weights, your own GPUs) are real and narrow. Data residency. Latency or cost ceilings no hosted offering hits. ITAR-style isolation that won’t ever cross a public-cloud boundary. Everything else is managed, and the engineering decision is the layer above it.

This decision is downstream of one most teams haven’t answered: “AI-first” or AI-assisted? You can’t pick the right inference layer until you’ve decided what shape of work the agents are actually doing.

The coaching layer

The rules I had at Valispace were short. If you do something twice, you automate it. If you don’t use something for over a year, you drop it. Nothing on a personal laptop, everything in version control.

I didn’t enforce those rules in code reviews. I didn’t need to. The tooling made the rules cheap to follow. The CI made deviation expensive. After six months on the team, an engineer wouldn’t think to do it the old way.

That’s the part most engineering leaders miss when they talk about “AI adoption.” They treat it like a training problem. Send the team to a workshop, show them the prompts, hand them the keyboard, expect agentic engineering to materialize. It doesn’t. The reps don’t compound through content.

What does compound: putting the rules into the road itself, as enforcement, not as documentation. The CLAUDE.md that the reviewer agent is wired to reject diffs against. The pre-commit hook that runs the eval suite on agent-generated changes. The MCP that exposes the bug tracker so the agent grounds an issue in your team’s actual taxonomy instead of inventing one. Guardrails as code, not training as content.

The team learns the rules by living inside them. Same way Valispace developers learned the release process — by working in ValiAdmin, watching it run, occasionally asking why a check fired. Coaching wasn’t a session. It was the workflow.

I did this once in someone else’s department. Post-acquisition at Altium, I inherited an engineering org that had just doubled in size. The leads thought their friction was tooling and headcount. It wasn’t. The new half of the team didn’t have a road to operate on — no enforcement, no shared scaffolds, every engineer reinventing the basics. I spent the next year building one. 90% of the team stayed through the merger, most of them with shareholder cash on the table to leave.

You can’t upskill a 100-engineer org with a workshop. You upskill it by making the right behavior the path of least resistance.

The one thing that doesn’t transfer

Valispace shipped a release every two weeks. Forty-plus releases a year. 400 customer deployments updated in 4 hours on a Sunday night. Automated rollback if a deployment failed. Migration scripts dry-run on staging clones first. Dozens of checks running before the deploy button was even available.

The headline metric the founder asked about was speed. The actual metric we tracked was the cost of failure. If a release broke production, what was the blast radius, and how fast could we recover? Two-week cycles were possible because failure was cheap — cheap because we’d built every fail-safe into the path before we built the path itself.

The agent-PR loop wants to be the same shape. The unit got smaller, the cadence went from weeks to hours, but the discipline transfers: every PR an agent opens is a small unit of change with automated tests, automated review, automated revert. The teams I see struggling with agent-generated code are almost always the teams that didn’t have a cheap-failure SDLC before agents arrived. They were paying a bug-bash tax already. The agents just scaled it.

That much rhymes. Here’s the part that doesn’t.

A bad deploy is legible. The alert fires, the rollback runs, the blast radius is the customer base, recovery is measured in minutes. The Sunday-night release that broke would be visible by Monday morning. Cheap because detectable.

A bad agentic pattern is illegible. The PR passes lint and tests. The reviewer agent approves. The diff merges. Six months later you’re untangling an architectural drift you can’t trace to a single decision, because no single decision caused it — it accumulated across hundreds of PRs that each looked locally fine. The metaphor I’d use now: it isn’t a fire. It’s mineral deposit in the pipes. By the time anything is restricted enough to notice, the system has been quietly degrading for two quarters.

The “cheap failure” doctrine assumed failure was detectable. That assumption is the part of the discipline that breaks.

The deeper reason is that the unit of work shifted from deterministic to non-deterministic. ValiAdmin was a deterministic control plane: same input, same state, same output. Terraform plans diff, infrastructure converges, rollbacks are exact. Agent code generation isn’t deterministic. Same prompt, different output. Same review criteria, different judgment from one run to the next. Same MCP, different decisions about which tool to call. The variance is baked in. You can clamp it. You can’t remove it.

So the architecture has to invert. The old discipline was: instrument the path so any single failure is detectable and recoverable. The new discipline has to be: instrument the pattern-space, because no individual diff is the failure — the failure is the cumulative drift of hundreds of diffs that each looked locally fine. The old detection layer found regressions. The new one has to find trends across diffs that don’t individually regress anything.

Concretely, the agent-era detection layer has at least four parts most teams haven’t built:

Architectural-drift evals. Not unit tests. Periodic structural checks against the dependency graph, AST-level pattern adherence, abstraction-leak detection. The question they answer isn’t “does this PR work” but “is the codebase getting worse, slowly, in a direction nobody chose.”
Trace correlation across agent handoffs. The most expensive failures span a planner agent, an implementer agent, and a reviewer agent. None of them individually fails. The handoff loses context and the output is wrong by a step nobody owns. OpenTelemetry-style correlation, but for multi-agent loops.
Token-spend budgets as first-class production constraints. Not finance reports after the fact. Real-time per-team budgets that page when they’re breached, the same way you’d page on infra cost. The teams without this have a single corporate AWS-style bill that grows quarterly and nobody owns.
Pattern-level regressions in the eval suite. Not “does this answer match the golden?” but “is the model getting worse at the kind of judgment we care about?” Run on a corpus, scored on direction not on individual answers, treated as a release gate.

The metrics most teams track for the agent loop are lead time and deploy frequency, the DORA staples. They measure the easy half. The actual constraint sits upstream of all of them — and for the agent-PR loop specifically, the actual constraint is whether the failures the loop produces are the ones the loop can see.

The Valispace cycle worked because the failure was loud. The agent cycle works only when the failure is made loud by design.

What the road has to do now

Doing something twice in 2022 meant writing a Bash script or a Python tool. Doing something twice in 2026 means writing a skill, a rules file, an MCP, an agent instruction. Same impulse. Different surface. I built Bridgeport because I’d done a manual deploy more than twice. I write Claude Code skills for the same reason.

The discipline rhymes with what it was. The failure model is the part that doesn’t survive intact.

The Valispace road made deployment safe enough to run on a two-week cycle. The 2026 road has to do more. It has to make non-determinism legible. Detectable across hundreds of PRs that each looked fine in isolation. Detectable across agent handoffs that span environments. Detectable when the diff passes every gate you built and still degrades the system by inches.

If you run engineering, the move tomorrow morning isn’t to hand out more Copilot licenses. It’s to audit your failure-detection layer and ask whether it was built for legible failures or illegible ones. If it was built for the old shape — alerts on broken deploys, regressions on shipped diffs, dashboards that show the last hour — you’re under-instrumented for the cycle you’re already in. The cost will show up six months from now, in a backlog of architectural drift you can’t trace to any single decision.

Same building discipline. New thing the road has to carry.

My work is mostly well done when nobody talks about me. That part hasn’t changed.

Lead Time Is the Wrong Half

joaofogoncalves — Thu, 14 May 2026 00:00:00 GMT

The wrong half

Most engineering teams measure how fast a PR ships. Almost none put it on the same slide as how fast a user complaint becomes a deployed fix.

The first number is the easy half. The second is the one customers feel.

I’ve seen the easy-half metrics circulate at our team and at every other team I’ve worked with. Three hundred and thirty-four PRs in a month. A twenty-two-minute mean lead time. Ninety-eight percent of PRs merged in under a day. Found-a-bug-yesterday, fixed-it-today. All real, all flattering, all measuring the half that begins after the work has already been noticed, scoped, prioritized, and started.

Report to deployment is what’s left after you cut that half off. Ours last month was 18.6 hours from issue filed to in production, across 78 feedback-sourced issues. The DORA-style lead time on the same work was 22 minutes. Two clocks on the same loop, fifty times apart.

We built ours at a tenth the size

Shopify has been writing about an AI agent it built called River. Tobi Lütke’s post on it is the clearest description I’ve read of the design choice we made too. River lives in their company Slack. She doesn’t take direct messages. If you DM her, she politely asks you to open a public channel and try again. The thing I keep coming back to in their writeup is the constraint: every conversation River has is searchable, watchable, and editable by anyone in the company. About one in eight pull requests merged into their main codebase last week was authored by her.

I read about it the same week our team was halfway through building ours.

Ours is called aiBerto. He has a hoodie, a moustache, and headphones.

He lives in our Slack. He picks the next bug off our project board every couple of hours, builds the fix, opens a PR, runs CI, fixes failing checks, asks for a human review when he’s stuck. He triages Sentry. He reads our user-feedback channels, files GitHub issues for the unhandled threads, replies in the thread when the issue lands. He does not have a DM inbox.

We are not five thousand people. We are fewer than ten.

Same constraint, scaled down.

What the loop actually does

A user complains in our feedback channel. aiBerto reads the thread, decides whether it’s a real issue, asks a clarifying question if it isn’t sure, opens a GitHub issue when the picture is clear. The issue lands on the project board. /pick claims the highest-priority one and announces what it’s working on in the team channel. /build writes the code, opens a draft PR, runs the tests, fixes the CI failures, marks the PR ready when it passes. A human reviews it. It merges.

Every step lives in a public channel. If /build gets stuck on a CI failure it can’t recover from, it pings the engineering channel and walks away. If a Dependabot security update lands overnight, aiBerto either resolves it autonomously or files a ticket and escalates. If a teammate @-mentions him in a thread, he answers in the thread, with the context from the thread, never from a private working memory.

The receipt at the end of the month is one git history and one Slack archive. The two together are the operating manual.

The other half

DORA gave most engineering orgs a default vocabulary for velocity: lead time for changes, deployment frequency, change failure rate, mean time to restore. The first one is the one everyone reaches for.

That number can look great while the company moves slowly.

Twenty-two minutes is a real measurement. We hit it. But the clock starts at the first commit, after a piece of work has already been scoped, prioritized, picked up, and written. Everything before the commit (somebody noticing the bug, somebody else reproducing it, a third person deciding it’s worth fixing, a fourth person writing the issue, a fifth person finding time to start) is not in the number. In most teams, that prelude is where the days disappear.

Two teams can hit the same lead time and have wildly different report-to-deployment numbers. The first triages every user report inside an hour, runs an on-call, has a single shared queue. The second sees a Sentry alert in the morning and circulates a Slack message asking who should own it. Both teams ship in 22 minutes. One ships in six hours total. The other ships in six days.

This isn’t a metric I’m inventing. It’s closer to what Mik Kersten calls Flow Time in the Flow Framework: the elapsed time from an idea becoming legible to it being delivered, including the planning and waiting and triage that DORA’s lead time intentionally excludes. Industry-average Flow Efficiency sits around 15 to 25 percent. Work items in a typical team spend 75 to 85 percent of their lifetime waiting.

Report to deployment is Flow Time aimed at the bug-and-improvement loop, with the clock starting at the first external signal: a Sentry exception, a customer message, an internal Slack thread. It stops when the fix is in production. It counts triage, scoping, prioritization, implementation, review, and release. It’s the number the customer actually feels.

You don’t shrink it by shipping faster.

You shrink it by not stalling at the seams.

What compounds in public

The most interesting thing about running the agent in a public channel isn’t the PRs he ships. It’s the way the team gets better at running him.

aiBerto’s skills (/pick, /build, /heartbeat, /fix-dependabot, /ship, /changelog) are Claude Code custom skills: markdown files in our repo. Forty of them changed in the last month. Some got merged into others, some got deleted entirely. Most got patches because someone watched aiBerto get stuck on a specific failure mode and wrote down what he should have known.

A recent one: aiBerto’s /build skill was occasionally merging a PR without explicitly linking it back to the originating GitHub issue. The PR closed the issue via a keyword in the body, which is enough for GitHub but not enough for our project board. A teammate noticed the same failure twice, patched /build with a “guarantee the PR-issue link before merging” step, and the failure mode disappeared. He didn’t retrain anything. He wrote a sentence.

The sentence is in the repo now. Every future aiBerto run reads it.

My job used to be triaging Sentry. Now my job is writing the sentences aiBerto needs to triage Sentry himself.

This is the compound effect Shopify keeps describing in their public writeups about River. The agent’s average correctness rises not because the model changed, but because the people closest to each kind of work added the missing piece of context. He gets better at being our team.

The shape works at five thousand engineers and at fewer than ten. What I’m less sure about is the middle. At 200 engineers, three time zones, and compliance constraints, “the public channel” stops being one channel and becomes a permission graph: legal can’t see customer-PII threads, the embedded fintech team can’t put session tokens in a channel that contractors join, and the review queue becomes its own chokepoint regardless of how short the front half gets. The mechanism that compounds — single shared surface, low-friction triage, willingness to make the work visible — probably scales. The single shared surface itself does not. That’s where the redesign lives.

If aiBerto were a private window (a DM, a Cursor session, a chat sidebar) only one person would learn from any given interaction. The next teammate hitting the same friction would have to discover the patch from scratch. Most of the “AI productivity gain” you read about is shaped that way: an individual gets faster, the org doesn’t.

Public agents move the gain to the organization. Private agents keep it on the individual.

The receipts

For the month between April 6 and May 4, aiBerto:

merged 30 pull requests, all of them autonomously, with zero human commits on his branches
opened 107 GitHub issues from Sentry triage and user-feedback channels
resolved 104 issues, mostly the ones he’d opened himself
handled 81 Slack interactions across product and engineering channels
processed 68 Dependabot updates, merging the safe ones and escalating the rest
had his own skill set patched 40 times by the humans watching him work
cost something on the order of $500 in API spend

Across the whole team that month, the median time from issue filed to in production was 18.6 hours, measured across 78 feedback-sourced issues. That’s the report-to-deployment number, well under the 24-hour bar that DORA’s pre-2025 reports used as the elite threshold for the much narrower lead-time-for-changes metric. Ours counts much more of the loop. Our engineering team is three humans plus the agent. The three humans shipped 69 issues that month. aiBerto’s 30 PRs landed in parallel. Every one authored autonomously, with zero human commits on his branches.

A few of those numbers are flattering against industry benchmarks. The twenty-two-minute mean PR lead time clears any reasonable elite threshold. A hundred-percent autonomy rate on merged PRs is uncommon on the public record. Eighty-seven percent label coverage on PRs and ninety-nine percent on issues means almost everything is searchable along the dimensions you’d expect.

A few are honestly not. The 30% CI first-pass rate is the number I find most useful, and it isn’t flattering. It says aiBerto fails the first CI run roughly seven times out of ten and then fixes himself before asking for help. Compute is dramatically cheaper than human context-switching, so I read it the other way: he’s encountering the friction the linters and type-checkers were going to catch anyway, he encounters it himself before a human reviews, the cost shows up in API tokens rather than in stand-ups. The loop closes either way.

The 18.6-hour number also lives downstream of a triage step the agent runs. aiBerto reads each Sentry alert and Slack thread and decides whether it’s a real issue before opening one. That filter probably tilts the denominator toward well-formed work. To falsify it I’d sample raw signal traffic for a month and re-measure. We haven’t.

There’s also a list of numbers we don’t publish that we probably should: a clean change-failure rate, mean-time-to-restore, Flow Efficiency (the ratio of active work to wait time, which would show us where in the loop we’re actually stalling). The receipts above are the work that fell out of the loop, not a full audit of the loop’s quality.

The other thing those numbers don’t capture is the rhythm. aiBerto starts work every couple of hours, around the clock. The merges cluster during business hours because humans approve them. The PRs that wait for review until morning are not the bottleneck.

The bottleneck is review.

That moved sooner than I expected. With one human reviewer per PR and the agent generating roughly seven PRs a week, a queue forms inside two days if nothing else changes. DORA’s 2025 State of AI-Assisted Software Development report found that as individual productivity rises with AI adoption, organizational delivery metrics — lead time, deployment frequency, change failure rate — often stay flat or get worse. The review queue is one of the places that gap opens up.

Three things changed in our review practice as a result. First, aiBerto opens smaller PRs by construction: one issue, one change, no opportunistic refactors. The diffs read fast. Second, low-risk categories — label-only changes, copy fixes, fully-covered backend patches with green CI — merge under our auto-merge policy without human review at all. Third, when a human does review, the question shifts. It’s no longer “did you write good code.” It’s “is this the change we wanted.” That’s a different kind of review, and it doesn’t get easier.

What we haven’t solved is architectural drift. The agent ships a clean diff that solves the issue but misses a refactor a senior engineer would have caught. Each individual PR is fine. The fifth PR through the same area of code is the one that surfaces the pattern, and by then the previous four are merged. That’s the part we don’t have a metric for yet.

Back to the channel

The easy-half metric was always going to flatter us. A team can hit an elite lead time and ship at the speed of monthly stand-ups, and nobody asks why. The number doesn’t budge.

Report to deployment is the metric that breaks if any seam in the company is slow.

When the agent runs in a public channel, there’s no second story to tell about how engineering is going. The Slack history is the work history. The PR diffs are the daily output. The skill changelog is the management ledger. They are what’s left when the work has nowhere else to hide.

Pick one metric. Pick the one that includes the part of the loop you usually have to make excuses for.

The agent doesn’t need a dashboard.

It needs a channel.

AI as the Great Filter

joaofogoncalves — Mon, 04 May 2026 00:00:00 GMT

The filter is already running

In March, Amazon’s retail site went down twice in three days. The second outage knocked U.S. order volume down by 99% for about six hours. Roughly 6.3 million orders never landed. Both incidents were traced to AI-assisted code that shipped without proper review.

A few months earlier, Amazon’s own AI coding agent, Kiro, had been asked to fix a bug in the Cost Explorer service. Its proposed fix was to delete the production environment and rebuild it. The action got approved. Cost Explorer in mainland China went down for thirteen hours.

Amazon’s official position on all of this is that these were user errors, not AI errors — misconfigured permissions, missed reviews, the same mistakes any tool can amplify. That framing is convenient, but it’s also probably right, and it doesn’t make the outcome any better. Whether you call it an AI failure or a governance failure, the pattern is identical. An action got proposed. The review didn’t catch it. The blast radius was bigger than anyone could undo.

Robin Hanson coined the Great Filter in 1996 as a way to think about the Fermi paradox. If life is common, why is the universe quiet? His answer: somewhere along the path from microbes to interstellar civilization there is a step almost nothing survives. Maybe it’s abiogenesis. Maybe it’s intelligence. Maybe it’s the moment a species invents technology powerful enough to end itself.

I keep thinking about a smaller version of it. Not for civilizations. For engineers, teams, and companies.

A filter that’s already running. Quietly. Right now.

The slot machine in your IDE

A few weeks ago I wrote about BullshitBench, a benchmark that feeds language models false premises wrapped in confident-sounding jargon and measures how often they push back. Claude Sonnet 4.6 pushed back about 90% of the time. Most of the others were close to a coin flip.

That number is funny until you remember what it means in practice. Roughly half of the major coding assistants engineers use daily will agree with a wrong assumption rather than correct it.

This isn’t a bug. It’s a training artifact. Anthropic’s 2023 paper Towards Understanding Sycophancy in Language Models showed that human raters, given the choice between an accurate response and a flattering one, often pick the flattering one. Preference-based training inherits the bias. The 2025 SycEval paper found that across major frontier models, sycophantic flips — where the model abandons a correct answer once a user pushes back — are common and predictable. A 2025 medical-domain study found that GPT-4 and GPT-4o complied 100% of the time with prompts designed to elicit logically inconsistent drug advice. The best-refusing open model still failed to push back about half the time.

The model isn’t lying. It’s optimizing. It’s been trained on a reward signal that ranks pleasant interactions higher than correct ones, so it produces pleasant interactions.

What dumb mistakes look like at AI speed

There’s a phrase I heard in an internal Slack thread last week, half-joke and half-warning. Someone was describing a real audit they’d done years ago at a Portuguese bank, where the engineers had put the database backup on the same server as the database itself. The auditor asked the obvious follow-up. What about a fire? What about both rooms burning? Crickets.

The thread eventually landed on a one-liner: AI is great at empowering people. Including the wrong ones, to ship the wrong things faster.

That isn’t a dunk. It’s an architecture observation.

The backup-on-the-server class of mistake didn’t appear with AI. It has been around as long as we’ve had servers. What changed is the latency between making the mistake and seeing the consequences. A junior engineer who suggests deleting a production environment used to be stopped at the PR review. An AI agent given the same suggestion can execute it before anyone wakes up.

A 2025 survey found that 43% of AI-generated code changes required debugging in production. Not in staging. In production. That’s the percentage of cars on a freeway failing inspection after they’re already at highway speed.

Amazon’s response, after the March outages, was to lock down 335 critical systems and require senior engineer sign-off on every AI-assisted code change. That’s the right reaction. It is also a tax on AI velocity that smaller companies, where most of the new building is happening, are not going to pay.

So they will skip it. Until they don’t.

The gambling loop

Gambling addictions don’t develop on losses. They develop on intermittent wins. Slot machines are a clean implementation of variable-ratio reinforcement, the schedule that produces the most resistant behavior in animal studies. The pattern keeps you pulling.

AI coding tools have the same architecture, accidentally. Sometimes the output is junk. Sometimes it’s a working function. Occasionally it’s a fully refactored module that would have taken you a day. The wins get cached as “I am now an engineer who ships features in an hour.” The losses get rationalized — wrong context, bad day, my fault for the prompt. The reward signal stays positive even when the average outcome doesn’t. By the time the codebase has accumulated a pile of half-understood abstractions, the bill is already due.

The strongest objection

The best version of the counterargument goes like this. A junior engineer in 2026 working with Sonnet 4.6 has access to a Socratic tutor that breaks down code, walks through tradeoffs, and answers “why?” indefinitely. That is better feedback than most of us got from anyone except our best Staff engineer in 2014. So the depth gap closes, not widens. The filter doesn’t sort against juniors. It accelerates them.

I think this version is partly right. AI is a phenomenal learning aid for engineers who already know how to learn from one. But the failure mode in the data isn’t “AI substitutes for depth.” It’s that sycophancy kicks in hardest when the user can’t push back. A junior who treats the model as a Socratic partner — who asks “why?”, who runs the suggestion mentally before accepting it, who notices when an explanation is just confident-sounding noise — gets the tutor. A junior who treats the model as an oracle gets the slot machine. The deciding variable is what the human brings to the loop. That’s true of textbooks, Stack Overflow, and every senior engineer who tried to mentor anyone. The new thing isn’t the tutor. It’s that the worst-case mode now ships to production at machine speed.

What survives the filter

The engineers who come out of this era ahead are not the ones using AI less. They’re the ones using it harder, with their hands on the wheel.

There’s a pattern I see in the people I trust most on this. When an AI agent generates a function, they read it the way they’d read a junior engineer’s PR. With suspicion, with care, with a mental list of the failure modes they’ve personally seen. They notice when the test coverage is theatrical. They notice when the abstraction is too clean for the problem. They notice when the model is confidently wrong in the first paragraph and the rest of the answer is just downstream of that.

They also use AI more aggressively than anyone, because they’re not afraid of it. They know what to throw out. They know what to keep.

The depth premium compounds. The deeper your model of the system, the faster you can reject AI output that doesn’t fit it. The faster you reject it, the more iterations you run. The more iterations you run, the more real value you extract. None of it works if there’s no model to start with.

For the engineer who never built that model, the loop runs in reverse. AI generates plausible-looking code. They can’t tell if it’s wrong. They ship it. It works in staging. It breaks in production. They ask the AI to fix it. The AI generates a plausible-looking patch. They ship that too. Repeat, until the codebase is a graveyard of confidently-written abstractions that nobody owns.

The filter is selection, not extinction

Hanson’s Great Filter is a probability barrier. The version running on engineering teams right now isn’t an extinction event. There won’t be a Tuesday when every shallow-knowledge engineer wakes up unemployed. The selection happens in the gap between two trajectories.

One: the engineer who treats AI as a force multiplier on depth they already have. They get faster. Their architectures get cleaner because they have time to think about them. Their reviews get sharper because the AI handles the boilerplate. Six months in, they’re shipping work that used to take a team of three.

The other: the engineer who treats AI as a substitute for the depth they don’t have. They also get faster. But their codebase accumulates hidden cost. Their understanding gets shallower with every shipped feature, because the AI did the part that used to teach them. Six months in, they’re producing more code than they can defend, in a stack they can’t fully reason about.

For a while, you can’t tell them apart. They’re both shipping. Their managers see green dashboards. The metrics look fine.

Then something breaks that requires real understanding to fix. And only one of them can fix it.

The engineers letting AI run unchecked aren’t getting filtered today. They’re being selected against. The kind of selection that doesn’t show up in this quarter’s numbers, and shows up in next year’s.

The Great Filter, locally, is the gap between depth and the appearance of depth. It’s been running the whole time. AI just turned up the speed.

Experience Isn't the Tax. Identity Is.

joaofogoncalves — Tue, 28 Apr 2026 00:00:00 GMT

What Jaya got right

There’s a piece going around called “Experience is now a tax.” It went up a couple of days ago and has been doing the rounds. Jaya Gupta wrote it. If you haven’t read it, the setup is this. Somewhere right now there’s a CIO at a Fortune 500 firm who has never opened Claude, can’t explain what a Claude skill is, and still asks his reports to print documents and leave them on his desk. He is the person making the decision about his firm’s AI ROI.

Meanwhile, a 22-year-old is writing production code in an afternoon and turning a napkin sketch into a working prototype before lunch.

These two people are having different experiences of the same technology, and the cultural conversation hasn’t caught up. The senior cohort defends its seat with two words: judgment and taste. Things AI cannot replicate. Things that take decades to develop. Things, conveniently, that the person making the argument has spent a career accumulating.

Jaya’s argument is that AI just collapsed the cost of three decision-making algorithms the brain runs constantly. Trying something new versus sticking with what works used to be expensive. Carrying knowledge in your head versus offloading it used to be the senior advantage. Committing to a decision versus reversing it used to be a one-way door. AI cheapened all three. The skill is no longer weighing every option before choosing. It’s choosing fast, learning fast, and not attaching your identity to the last version of yourself who made the previous choice.

She lands the punch on the third one. Senior people reverse slowly because they’ve spent careers learning that reversing is admitting they were wrong. Young people haven’t learned to attach identity to their decisions yet. Reversal feels like iteration to them. Her closer to young readers: if you can still think clearly without a filter, leave the environment that’s training you not to.

The diagnosis is right. I see it every week. Some of the smartest engineers I know are the slowest to ship because they’re protecting an old version of themselves who knew the answer before AI changed what knowing meant.

But the article mis-attributes the variable. The thing being taxed isn’t experience. It’s identity attachment to past decisions, and that’s a separate axis from age.

The variable isn’t age

Jaya names the right phenomenon, then attaches it to the wrong axis. She points at the senior cohort and says: this group has more reputation to defend, more decisions tied to their identity, so reversal is more expensive for them. All true. But she frames identity attachment as a natural byproduct of having been right in public for a long time.

It isn’t natural. It’s a choice, made compulsory by environment and habit.

Here is the steelman of Jaya’s position, and it survives most of the data. On average experience correlates with identity attachment because the longer you’ve been right in public, the more reputation you’ve put on the line. Most senior people are paying the tax. That’s true at the population level, and it’s the part of her argument that doesn’t go away.

The piece’s gap is conflating the average with the rule. Some senior operators figured out, somewhere along the way, that holding your previous self lightly is the only way to keep clarity past the age it’s supposed to expire. They aren’t outliers because they got lucky. They built environments around themselves, and habits inside themselves, where reversal stayed cheap. They’re the existence proof that the tax is a default, not a fate.

The 2026 adoption data is also messier than Jaya’s piece suggests. A poll of 4,000 US and UK workers found senior staff adopting AI faster than junior peers, not slower. Top earners have better access to paid tools, more dedicated training time, and more autonomy to experiment. By 2025, 73% of director-level workers had adopted AI, against 65% of individual contributors. Caveat the numbers honestly. “Adopted” is a low bar. The headline gap is single-digit. Among regular AI users only, 21% of leaders report extremely positive productivity impact against 13% of individual contributors, but that’s self-report inside an already-fluent population. None of these prove seniority dominance. What they show is that the binary “old people are slow, young people are fast” doesn’t hold cleanly even at the population level.

The decade of life isn’t doing the work. The decade of identity-protection is.

What the verification asymmetry looks like

There is one finding in the cognitive offloading research that matters here. When AI produces an output, you have to evaluate it. Hold the claims against what you know. Spot the hallucinations. Decide whether to revise or regenerate. That work is cognitively demanding. Experienced professionals catch errors faster because they have deep domain knowledge to compare against. Novices pay a verification tax that sometimes cancels out the efficiency gains.

In software this shows up cleanly. Junior engineers using Claude daily are shipping faster than they were a year ago. Some of the seniors using Claude daily are shipping faster than the juniors and catching a category of issues the juniors didn’t see was a category.

A specific example. Last month I watched a junior engineer prompt Claude to refactor a payment service. The output was clean, the tests passed, and the refactor did exactly what the prompt asked. A senior on the team noticed in about thirty seconds, in code review, that the new version had quietly removed a retry-on-rate-limit pattern that wasn’t covered by the test suite. The model had simplified the wrong thing. The junior’s verification was test-pass. The senior’s was twenty years of “what does this look like in production at 3am.” Both shipped. Only one would have caught it.

This is the asymmetry the binary obscures. AI fluency without a pattern library lets you ship things you can’t fully evaluate. A pattern library without AI fluency keeps you from shipping at all in the new medium.

It isn’t a moat in the durable sense. Tool fluency in 2026 won’t look like tool fluency in 2028, and anyone betting their career on “I’m fluent in Claude” is making the same mistake the CIO made one cycle earlier. What’s durable is the disposition that produced the fluency. The willingness to put hours into something you didn’t grow up with, learn it like a craft, and let your priors get tested by it instead of using them to deflect the test.

What the discipline looks like in practice

It isn’t judgment and taste. That phrase is exactly the defensive crouch Jaya named.

The patterns I see in operators who have un-taxed themselves:

They run their own experiments. They don’t delegate the prompt to a junior and review the output. They paste their own context, read their own raw output, fix their own broken prompts. They’ve put in the same kind of hours they put into mastering a previous craft.

They reverse without ego. The last decision is a hypothesis, not a stake. When the data comes back wrong, they don’t relitigate. They commit again.

They use AI to attack their own priors. Most people use the tool to confirm what they already think. The discipline is asking the tool for the strongest counter-argument to your position, the cleanest version of the case against, the data point that breaks the model. Adversarial use, on purpose.

They’ve stopped leading with “in my experience.” That phrase used to be a gear shift in a meeting. Now they treat it as a flag. They might still have the analogy. They might be right. But they’ve noticed that “in my experience” arriving early in a conversation often shuts the conversation down before anyone has tested whether the analogy actually fits this case.

They pattern-match live. Instead of pulling examples from memory, they pull them from a tool that has more examples and less ego. The senior person who used to win the room with “I’ve seen this before” now does that work in real time, with citations.

I run this loop at BRIDGE IN. Specifically: I open a task, write the spec myself, prompt Claude to find the worst implementation I’d still accept, then prompt it to defend the cleanest one, then argue with it. The thing that’s changed in the last year isn’t that I’m faster. It’s that the version of me who would have shipped the first plausible-looking design is gone. The tool replaced that version of me.

The discipline is identifiable by behavior, not by years on a CV.

The thing being taxed

The thing being taxed isn’t experience. It isn’t age. It’s the version of you that needs to be right.

The CIO isn’t slow because he’s old. He’s slow because he can’t afford to be visibly wrong. He spent thirty years building a self that was correct, and the cost of admitting that self is now operating in a medium it doesn’t understand is higher than the cost of pretending the medium doesn’t matter.

The 22-year-old isn’t fast because she’s young. She’s fast because her identity is in motion. Nothing she’s said publicly has the gravitational mass of three decades of correct calls. She can change her mind cheaply.

That’s the real variable. Not chronological age. Identity weight.

Some senior people have built careers without putting much weight on any single call. They reverse easily because they never made reversal the expensive thing. Some 22-year-olds are already attaching identity to their first big decision. You can see it in the way they double down on a take that didn’t land. The tax shows up in them too, just earlier.

This is reversible. Most people don’t reverse it. That’s the honest version of Jaya’s point. The base rate is real. The ones who do reverse it figured out, somewhere along the way, that holding your previous self lightly is the only way to keep clarity past the age it’s supposed to expire.

The discipline of staying clear

Jaya closes her piece with a message to young readers. If you can still think about a problem without first running every thought through “yes but what would my boss or the world say,” use that ability now. Use it while you have it. The window narrows faster than you think. If you’re in an environment that punishes the clarity you currently have, leave.

That’s true for the toxic environments. It isn’t the only move available, and it isn’t the deepest reading of her own argument.

Clarity isn’t an asset of being young. It’s a discipline.

Some 22-year-olds will lose it inside ten years if they let the next decade teach them that being right is a personality. Some 50-year-olds never lost it because they refused to learn that lesson. They built environments around themselves where reversal stayed cheap. They kept the tool they had at 22 and added thirty years of analogies to it.

The discipline is identifiable by behavior. Who’s running their own experiments. Who’s reversed publicly in the last quarter. Who’s using the tools in their actual workflow and not just their demo. That’s the question, for hiring, for learning, for staying sharp yourself.

Experience is only a tax if you treat it as a fixed asset. As a moving one, it compounds.

AI Will Create More Jobs. Just Not For Everyone.

joaofogoncalves — Thu, 23 Apr 2026 00:00:00 GMT

The paradox has two sides

In January 2025, Satya Nadella posted a link to the Wikipedia article on Jevons paradox. The timing was deliberate. DeepSeek had just released a model that made frontier AI cheaper overnight, and the market was wobbling. Nadella’s one-line commentary: “Jevons paradox strikes again! As AI gets more efficient and accessible, we will see its use skyrocket, turning it into a commodity we just can’t get enough of.”

He was right, at least on the aggregates. A year later, the data is lopsided in his favor. Software-engineering openings at tech companies are up roughly 30% year-to-date in 2026 per TrueUp’s aggregator (tracking ~9,000 tech companies), even as broader tech postings remain well below their 2022 peak. PwC’s 2025 Global AI Jobs Barometer reports that productivity growth in AI-exposed industries has nearly quadrupled, from 7% cumulative over 2018–2022 to 27% cumulative over 2018–2024. Revenue per employee in AI-exposed industries grew roughly three times faster than in least-exposed industries over the same window. The total labor pie is growing, and it’s growing fastest exactly where people predicted it would shrink.

This is the take that crushed on my LinkedIn last month. Cheaper software means more software. More software means more work. Nothing about it is false.

But pointing at the total and calling it good news hides what the totals are actually made of.

Quick detour: what Jevons actually said

In 1865, an economist named William Stanley Jevons noticed something counterintuitive about steam engines. They were getting dramatically more efficient. Each ton of coal produced more work than the year before. Everyone expected coal consumption to drop. Efficiency should save fuel.

It didn’t. Coal consumption went up.

The reason was straightforward once Jevons explained it. Cheaper energy opened uses that weren’t economical before. Factories expanded. New industries started. Railroads extended. The total demand for coal grew faster than efficiency reduced it.

That’s the paradox. When you make something more efficient, you don’t necessarily use less of it. You often use more of it, because cheaper access unlocks demand that was previously priced out.

It’s been applied to electricity, fuel economy, bandwidth, and now AI. The logic is the same every time: when the cost of a thing drops, demand for that thing tends to grow faster than the savings. The pie expands.

The part everyone quotes

If you read only the top-line numbers, Jevons looks airtight.

AI-related roles are up across the board, despite two years of tech-sector layoffs. PwC’s 2025 Barometer shows the AI wage premium doubling in a single year, from 25% in 2023 to 56% in 2024, meaning roles requiring AI skills now pay 56% more than otherwise comparable roles that don’t. The same barometer shows AI-exposed industries (not occupations) saw 16.7% wage growth over 2018–2024, revenue per employee growing three times faster than in least-exposed industries, and productivity growth nearly quadrupling as noted above. (One caveat worth naming: PwC classifies industries as “AI-exposed” via a task-composition proxy, not observed AI use. These are strong correlations, not causally identified effects.) Total coal consumption went up when steam engines got efficient, and total demand for software is doing the same thing.

The 2023 panic about AI replacing coders hasn’t played out, at least at the aggregate level. Claude Code ships code. Engineers ship more code. Companies ship more software. The market absorbs more software than it did last year. Nothing about the shape of the industry suggests contraction.

Mr Phil Games wrote a piece on exactly this called Jevons’ Paradox and AI: Why It Means More Developers, Not Fewer. The argument is straightforward. Software was never a fixed pie. The cost of writing it dropped. The volume being written went up. The ceiling on useful software is much higher than we ever acknowledged, because the constraint was always the cost of building, not the demand for it.

This is correct. I’ve said it myself, in almost these exact words.

The trouble is that the same dataset tells a different story if you read it from the bottom up.

The part nobody prices in

The totals are growing. The composition isn’t. Three independent datasets tell the same story. Per Indeed Hiring Lab, entry-level tech postings in February 2025 were off 34% vs February 2020 while senior postings were off only 19%, and the share of tech postings requiring 5+ years of experience rose from 37% to 42% between Q2 2022 and Q2 2025. The Stanford Digital Economy Lab’s ‘Canaries in the Coal Mine’ paper (Brynjolfsson, Chandar & Chen, November 2025 revision, using ADP payroll data through September 2025) finds a 13% relative employment decline for workers aged 22–25 in the most AI-exposed occupations, with 22–25-year-old software developers specifically down roughly 20% from their late-2022 peak. The 35–49 cohort moved the other way: up about 9% over the same window. The Dallas Fed replicated the age-AI-exposure pattern in CPS data in January 2026. Same industry. Opposite directions.

That 56% wage premium for AI skills? It doubled in a single year. A 25-point gap became a 56-point gap. Two years before that, the category didn’t really exist. The financial reward for AI fluency isn’t leveling off. It’s accelerating.

PwC’s data, read carefully, says the same thing in a less comfortable way. AI-exposed roles are growing revenue per employee 3x faster than non-exposed roles. That’s a widening gap between the boats that move and the boats that don’t.

The Anthropic Economic Index from March 2026 adds a quieter datapoint. Roughly 49% of O*NET-defined jobs have had at least a quarter of their constituent tasks appear in Claude conversations at least once, with computer and math work the largest category. Anthropic’s separate “effective coverage” measure, which weights by success rate, is materially lower; the 49% is a high-water mark, not a measure of AI actually doing the job. The gap underneath is where the distribution is separating. The Stack Overflow 2025 Developer Survey shows the same gap from a different angle: developer AI usage hit 84% (up 14 points year over year), but trust in AI output landed at just 33–54% depending on task type. Adoption is wide. Mastery is narrow.

The pie is growing. The line at the door is getting longer.

Why both things are true

Both readings are correct. They aren’t describing different economies. They’re describing different halves of the same one.

Jevons says: when a thing gets cheaper, more of it gets made. Software gets cheaper, more software gets made, more people get hired to make it. That’s the first-order effect and it’s real.

The second-order effect is the one nobody quotes. When the marginal cost of adequate output drops to near zero, the new demand skips past adequate entirely and lands on work that requires something adequate can’t deliver. Leverage. Judgment. Taste. The ability to orchestrate five agents toward one outcome. The ability to decide what’s worth building in the first place.

The sharper way to put it: the person doing adequate work at adequate speed is in trouble, because adequate is now free.

The middle of the skill distribution isn’t being squeezed by less work. It’s being squeezed by work it can’t do. Jevons creates the demand. Adequate-is-free determines who that demand reaches.

The shape of that squeeze is already visible in team-level telemetry. Faros AI’s 2026 report on roughly 22,000 engineers using AI-assisted tools found per-developer pull requests merged up 98% and epics per developer up 66%, while team-level PR review time jumped 441% and incidents per pull request jumped 243%. The bottleneck moved off typing. It moved onto review, triage, and judgment. Jevons delivered the volume; the volume is what needs to be judged.

Both things are true. They just measure different parts of the same shift.

Who gets the new jobs

There are two doors. Most people who will thrive in the next five years walk through both. Almost everyone who doesn’t will be locked out by one.

The first door is leverage. Can you produce the work of five people with a harness of agents? A senior backend engineer I work with ships three features in the time her team used to ship one, because she runs Claude agents in parallel on the review, the migration, and the documentation pass. Her day-to-day isn’t typing. It’s queuing, checking, correcting. The engineers whose postings are up are the ones building that kind of harness on themselves.

Leverage isn’t free on day one, and it isn’t automatic. The Harvard/BCG/Wharton “Jagged Frontier” study found that consultants using GPT-4 on tasks inside the model’s capability frontier produced roughly 40% higher quality output and finished about 25% faster. The same consultants on tasks just outside that frontier performed measurably worse than a control group without AI. Leverage looks like picking the right tasks to delegate, not delegating everything. METR’s 2025 study of sixteen experienced open-source developers working on their own repos found something uncomfortable along the same lines: when early-2025 AI coding tools were allowed, those devs took 19% longer to complete real tasks than when they weren’t, with the 95% confidence interval for the slowdown landing between +2% and +39%. They felt 20% faster. They were 19% slower. Brownfield code on a repo you already know is the hardest test for AI, exactly the case where the frontier is narrowest. METR’s own February 2026 update, titled “We are Changing our Developer Productivity Experiment Design,” reports preliminary signs of speedup with newer tools (returning-cohort devs −18%, newly-recruited −4%) but labels those findings “only very weak evidence” and is redesigning the study around severe selection effects. The 19%-slower headline for early-2025 tools still stands. Whether it generalizes to current tools is an open empirical question. The Stanford AI Index 2026, released earlier this month, lands in roughly the same place: controlled studies show 14–26% productivity gains for software engineering with current-generation AI, but gains turn “smaller or negative for judgment-heavy tasks.” Both readings agree on the shape. The tool gives you leverage where the frontier is broad and costs you time where it isn’t. The leverage door doesn’t open by installing Copilot. It opens after the dip, for the people who pushed through the first painful months when every prompt felt slower than just typing the code. Most people never clear the dip. The ones who do are the ones getting paid 56% more.

The second door is judgment. When adequate code is free, the bottleneck moves to deciding what to build, what to kill, what good looks like. Product teams that used to draft one spec a week now see ten variants before lunch. The PM’s value is no longer writing the spec. It’s triaging the ten. Most PMs don’t yet have the muscle for that kind of picking. The ones who do are about to get paid for it.

Adopters get both doors. Non-adopters get neither. The people in the middle, doing careful, correct, adequate work at a pace that used to be valuable, are discovering that neither door opens for them anymore.

But isn’t this just reskilling panic?

The skeptical reader’s objection writes itself. Every general-purpose technology produced this same narrative. Electricity was going to end factory labor. The PC was going to kill middle management. The internet was going to kill retail. The labor share didn’t collapse. People upskilled. New roles appeared. The economy absorbed the change.

It’s a fair objection. It may still be right.

But two things make this wave harder to pattern-match to the prior ones.

First, the adequate-is-free dynamic is categorically different. In past waves, a technology made a narrow skill obsolete: a specific typewriting task, a specific filing job. The worker’s broader capability still had a market. AI collapses “producing adequate output at adequate speed” as a general category. That’s a much wider zone of the distribution to reroute.

Second, the pipeline problem. If the first rung of engineering, writing mediocre code under supervision for four years, is the rung vanishing, how does the industry make senior engineers in 2030? The seniors using AI for leverage today didn’t train under AI supervision. The next cohort will. What they turn into is not obvious, and nobody is running the experiment deliberately.

The people betting on historical adaptation aren’t wrong. They’re betting on a pattern that may hold. But prior adaptation windows were measured in decades, not quarters. The best-studied general-purpose-technology parallel, the personal computer, famously took five to fifteen years to show up in aggregate productivity statistics at all. Paul David’s 1990 paper “The Dynamo and the Computer” named this the Solow productivity paradox, and Brynjolfsson and Hitt’s 2003 follow-up on IT and firm-level productivity showed the full diffusion curve playing out over a decade and a half. AI diffusion is running on a different clock. The Anthropic Economic Index alone recorded task coverage of O*NET jobs jumping from 36% in February 2025 to 49% in February 2026. Thirteen points in ten months. If adaptation is happening, it’s happening at a pace that leaves most of the workforce behind it.

But doesn’t AI code rot?

The second predictable objection, and a harder one, is about quality. If adequate is free, the argument goes, what actually shows up in the repos isn’t software. It’s a pile of plausible-looking code with a time bomb on it. The 10× output has a 10× maintenance tail. The wage premium is paying for a mess the industry will spend a decade cleaning up.

The numbers behind the critique are not nothing. GitClear’s 2025 research on millions of pull requests found copy-pasted code now exceeds refactored code for the first time since they started measuring, code clones grew roughly 4× in the two years after AI coding assistants went mainstream, and bugs-per-developer ticked up 54% on high-AI teams. Scientific American, summarizing Berkeley Haas research from early 2026, found AI-assisted developers shipping more tasks per week but also logging longer hours and more out-of-hours fixes. The METR numbers above are part of the same picture: the feeling of speed outruns the measure of speed by a wide margin, and nobody notices until they audit their own calendar.

Name these critiques properly. The first is the technical debt bomb: AI produces more surface area to break. The second is the productivity illusion: we are measurably slower at the exact moment we feel fastest. The third is the senior pipeline collapse: if the junior rung is vanishing (entry-level tech postings down 34% from 2020), where do the seniors of 2032 come from?

Every one of these is real. None of them invalidates the Jevons reading.

The maintenance tail on an adequate-code flood is not a reason to expect fewer jobs. It’s a reason to expect different jobs, concentrated in exactly the slots described above. More code written by agents means more code to review, more conflicting patterns to reconcile, more decisions about what to keep and what to delete, more senior judgment about which auto-generated migration will explode on Monday and which one is safe. The quality debt is the thesis stated from the other side of the ledger. Adequate output is cheap. Editing, triaging, orchestrating, and deciding are what everyone is about to discover they actually needed seniors for in the first place.

The senior pipeline argument is the one worth sitting with the longest. It’s the only one that doesn’t resolve neatly. If the junior rung of the ladder was always write mediocre code under supervision for four years and absorb taste by osmosis, and the junior rung is vanishing, it is genuinely unclear how the next cohort of seniors gets made. The optimistic answer — juniors skip straight to orchestrating agents and pick up taste faster because the iteration loop is tighter — is plausible and being run as a live experiment by approximately zero organizations on purpose. The pessimistic answer — the industry is eating its seed corn and won’t know it until 2030 — is also plausible. Nobody has the data yet.

What we can say: the short-run effect is unambiguously pro-senior, and that effect gets stronger the worse the quality of AI-generated code actually is. The critics and I are looking at the same data. We just draw opposite arrows from it. They say the flood of adequate code is a bug in the Jevons argument. I’d say it’s the engine.

Why this is hard to hear

The obvious response to all this is: fine, tell people to adopt. Run the training. Hand out Copilot licenses. Mandate the tools.

It doesn’t work that way. I wrote about this in an earlier piece, Pain Gets You In The Door. Curiosity Builds Everything Else. The research on AI adoption is consistent and uncomfortable. Mandate-driven adoption crowds out intrinsic motivation. The people who adopt because they’re told to don’t build the deep fluency that adopters-by-curiosity do. Luo, Zhou & Cui (2026) in Education and Information Technologies find perceived enjoyment (the intrinsic-motivation variable in a standard UTAUT model) the strongest predictor of generative AI adoption, ahead of performance expectancy, effort expectancy, and facilitating conditions. Lai, Cheung & Chan (2023) in Computers & Education: AI found the same pattern against Technology Acceptance Model variables. Across the literature, intrinsic motivation is among the strongest predictors of creative and frequent use. Ahead of training. Ahead of org support.

The pattern isn’t that mandates never work. IBM’s AskHR reportedly automated “a couple hundred” HR roles while IBM’s net headcount grew elsewhere (Arvind Krishna, WSJ, May 2025). The failure mode is narrower: mandates without curiosity tend to produce shallow users, not the fluent orchestrators leverage rewards.

The workers most exposed to the adequate-is-free shift are also the ones most likely to get the framing that guarantees they won’t close the gap. “Just use AI.” “Upskill.” “Get with the program.” These are the phrases of the adoption pattern that doesn’t stick.

If you’re waiting for your company to pull you through this shift, your company is statistically the worst-positioned actor to do it.

The paradox has a door policy

Jevons is real. AI will create more jobs. That was never the question.

The question was who gets to do them.

The top-line data says the pie is growing. The bottom-line data says the growth is concentrated above adequate. Both halves of the story are true. Nadella was right about Jevons. The adequate-is-free framing is the other half. The engineers shipping 10x with Claude Code and the designers watching their adequate-tier competitors get priced to nothing are living in the same economy, describing the same shift from different sides of the door.

If you’re already through both doors, leverage and judgment, the next five years are the best window you’ll see in your career.

If you’re not through yet, the move isn’t to wait for your company to train you. It’s to build leverage on your own time this week. Pick one task in your job that today takes you four hours. Rewire it so an agent does the first pass and you do the judgment pass. Ship it. Do the same thing next week with a different task.

The door is still open. It’s just getting stricter every quarter.

Sources & further reading

Jevons & aggregate demand

PwC 2025 Global AI Jobs Barometer - wage premium, productivity growth, revenue-per-employee ratios at industry level
PwC press release: AI linked to fourfold productivity growth - the 7% → 27% productivity-growth-rate figure
Anthropic Economic Index, March 2026 - O*NET task coverage across occupations (49% of jobs at ≥25% tasks)
Stanford AI Index 2026 - 14–26% productivity gains for software engineering with current-generation AI
Mr Phil Games, “Jevons’ Paradox and AI” - a clean statement of the optimistic case
Satya Nadella on Jevons, January 2025 - the post that started the cycle

Labor composition

Brynjolfsson, Chandar & Chen, “Canaries in the Coal Mine” (Stanford Digital Economy Lab, Nov 2025 revision) - 13% relative employment decline for workers 22–25 in most AI-exposed occupations (ADP payroll data through Sept 2025)
Indeed Hiring Lab: experience requirements have tightened - 5+ years experience requirement rose 37% → 42% Q2 2022 to Q2 2025
Stack Overflow 2025 Developer Survey (AI section) - 84% adoption, 33–54% trust
TrueUp via metaintro: engineering listings up ~30% - narrow tech-aggregator scope (~9,000 tech companies)

Productivity & quality evidence

METR, July 2025 - RCT on sixteen experienced OSS developers, 19% slowdown, 95% CI +2% to +39%
METR, February 2026 experiment redesign - preliminary speedup labeled “only very weak evidence” with severe selection effects
Faros AI 2026 telemetry report - 22,000 engineers: PRs merged +98%, review time +441%, incidents per PR +243%
GitClear 2025 code quality research - 4× code clone growth, +54% bugs per developer on high-AI teams
Scientific American, March 2026 - Berkeley Haas on longer hours for AI adopters
Dell’Acqua et al., “Navigating the Jagged Frontier” - Harvard/BCG/Wharton GPT-4 consulting study

Historical parallel

Paul David (1990), “The Dynamo and the Computer: An Historical Perspective on the Modern Productivity Paradox,” AEA Papers and Proceedings - origin of the Solow productivity paradox framing
Brynjolfsson & Hitt (2003), “Computing Productivity: Firm-Level Evidence,” Review of Economics and Statistics - 5–15 year IT productivity diffusion lag

Adoption psychology

Luo, Zhou & Cui (2026), Education and Information Technologies - perceived enjoyment strongest UTAUT predictor of generative AI adoption
Lai, Cheung & Chan (2023), Computers & Education: AI - intrinsic motivation vs TAM variables, same pattern

Your 'AI-First' Engineering Org Probably Isn't

joaofogoncalves — Sat, 18 Apr 2026 00:00:00 GMT

Most of the production code merged at BRIDGE IN this quarter was written by AI. I opened a bug against our onboarding emails yesterday morning. Diagnosis, fix, regression test, PR, CI, merge: done before my second coffee. Three months ago that loop was a sprint.

We didn’t get here by adding AI to our editors. We took the engineering process apart and rebuilt it around agents. We changed how we plan, how we implement, how we review, how we ship. We changed the shape of the team.

Three engineers, one monorepo, roughly fifteen specialized AI agents threaded through every phase of the workflow. I started the team earlier this year, and a couple of months in, I restructured the entire build-and-review flow around the assumption that agents do the keystrokes and people do the judgment.

One note before the rest. BRIDGE IN isn’t public yet. The numbers here are build velocity, not customer-shipping velocity. Whether the product lands with users is a bet still in flight. What the harness has already done is collapse the time between a decision and code that reflects it. At our size and runway, that is what decides whether we launch this year or next.

OpenAI published a concept earlier this year that captured what we’d been doing. They called it harness engineering: the primary job of an engineering team is no longer writing code. It is enabling agents to do useful work. When something fails, the fix is never “try harder.” The fix is: what capability is missing, and how do we make it legible and enforceable for the agent?

We’d been doing that for months before it had a name.

AI-First Is Not the Same as Using AI

Most teams bolt AI onto their existing process. An engineer opens Cursor. A PM drafts a spec with ChatGPT. A tester experiments with AI-generated test cases. The workflow stays the same. Efficiency goes up ten to twenty percent. Nothing structurally changes.

That is AI-assisted.

AI-first means you redesign your process, your architecture, and your team around the assumption that AI is the primary builder. You stop asking “how can AI help our engineers?” and start asking “how do we restructure the work so AI does the building, and engineers provide direction, judgment, and review?”

The difference is multiplicative.

I see teams claim AI-first while running the same sprint cycles, the same planning meetings, the same manual code reviews, the same weekly status calls. They added AI to the loop. They didn’t redesign the loop.

A common version of this is what people call vibe coding. Open the editor, prompt until something works, commit, repeat. That produces prototypes. A production system has to be stable, reliable, secure, maintainable. You need a system that guarantees those properties when AI writes the code. You build the system. The prompts are disposable. The harness is the asset.

Why We Had to Change

When I took over, I watched how the team worked and saw three bottlenecks that would have killed us.

The spec bottleneck. Planning a feature took a week. Writing it took two hours. When build time collapses from weeks to hours, a multi-day planning cycle becomes the constraint. It doesn’t make sense to think about something for a week and then build it before lunch. Product thinking had to move at the speed of iteration or step out of the build cycle.

The review bottleneck. Agents can open a pull request faster than a human can meaningfully read one. If an engineer ships a feature in a morning, hand-review by three people becomes the new wait state. Either you compress review (dangerous) or you automate part of it alongside human judgment (harder, but survivable).

The headcount bottleneck. Competitors in our space run engineering teams many times larger than ours. We couldn’t hire our way to parity. We could, maybe, redesign our way there.

Three things needed to operate at agent speed: design, implementation, and review. If any one of them stayed manual, it would constrain the whole pipeline.

The Bold Decision: One Repo, Legible Everywhere

I fixed the codebase first.

BRIDGE IN’s product lives as a single monorepo: backend, frontend, infrastructure, design system, docs, scripts, agent definitions. A human engineer can touch everything in one session. But the real customer of this shape is the agent.

A fragmented codebase is invisible to an AI agent. A unified one is legible. The more of the system you pull into a form the agent can inspect, validate, and modify, the more leverage you get. One repo. One test matrix. One typed contract between backend and frontend, regenerated from the OpenAPI schema with a single command. One set of CLAUDE.md files that codify, at every level, what “good” looks like.

I spent weeks designing the harness: the agent roster, the skills, the CI gates, the pre-commit hooks, the project board automations, the integrations into our existing observability and chat tools. Then I started asking the agents to rebuild the parts of the harness themselves.

BRIDGE IN is building an operations platform. We use our own harness to build the platform that will run the operations.

The Shape of the Harness

The details of the stack matter less than the shape. A few principles hold it together.

One monorepo. Backend, frontend, infrastructure, design system, docs, agent definitions, all in one place. Types regenerate from backend to frontend with a single command, so the contract between them is enforced, not documented.

One CI pipeline. Every PR runs the same gates: format, lint, types, migration checks, tests, coverage floor. No optional phases. No manual overrides. Deterministic, so agents can predict outcomes and reason about failures.

A roster of specialized agents, each with a narrow remit, each constrained by what it is not allowed to do. The architect cannot write code. The project manager cannot implement. Boundaries come from prohibitions, not instructions.

Skills that compose agents into workflows (feature delivery, bug triage, CI repair, dependency PRs) so that an engineer tagging an issue kicks off a sequence of planning, implementation, testing, and review without manual orchestration.

Pre-commit hooks and branch protection as the last line of defense. Nothing lands on master without a human reviewer and CI both agreeing.

The shape is what matters: one repo, one pipeline, specialized agents, composable skills, enforced gates. How an issue moves through that shape (who plans, who implements, what happens when CI fails) is a longer story I’ll tell separately.

The Results

Metric Before After Code written by AI Minority Majority Time from issue to merged PR Days to weeks Hours PRs merged per week A handful 20+ Dependency updates processed Manual, backlogged Automated, green on merge Human time spent on CI failures Hours Minutes

Over a recent two-week stretch we merged more than a hundred pull requests. A year ago that pace would have been physically impossible.

People assume you trade quality for speed. We didn’t. We ship more tests than we used to. We catch more regressions before they hit production. We have stricter lint, type, and coverage gates than before. The feedback loop is tighter. You learn more when you ship daily than when you ship monthly.

The New Engineering Org

Two kinds of engineers will exist.

The Architect. One or two people. They design the standard operating procedures that teach AI how to work. They build the testing harness, the review skills, the triage flows, the CLAUDE.md files. They define what “good” looks like for the agents.

This role requires deep critical thinking. You criticize the AI. You don’t follow it. When an agent proposes a plan, the architect finds the holes. What failure mode did it miss? What security boundary did it cross? What technical debt is it quietly accumulating?

The ability to criticize AI is more valuable than the ability to produce code. Producing code is the commodity. Critical judgment is the scarce skill.

This is also the hardest role to fill.

The Operator. Everyone else. The work still matters. The structure is different. The triage system finds a bug, surfaces the diagnosis, assigns it to the right person. The person investigates, validates, directs the agent, approves the fix. AI opens the PR. The human reviews whether there is risk. The work is bug investigation, UI refinement, accessibility fixes, PR review, verification. It requires skill and attention. It does not require the architectural reasoning the old model demanded every day.

I haven’t written a line of production Python by hand in weeks. I spend my time building the harness, reviewing what agents produce, and deciding what we build next.

Who Adapts Fastest

I noticed a pattern I didn’t expect. Junior engineers adapted faster than senior engineers.

Junior engineers with less traditional practice felt empowered. They had access to tools that amplified their impact. They didn’t carry a decade of habits to unlearn. They treated the agent as a collaborator they had to manage, not a tool that had to prove itself.

Senior engineers with strong traditional practice had the hardest time. Two months of their best work could collapse into an hour of agent output. That is a hard thing to accept after years of building a rare skill set.

Both things are probably true. Accumulated skill still matters. You cannot criticize what an agent produces without understanding what it produces. But in this transition, adaptability matters more than the skill you accumulated before it started.

The Human Side

I won’t pretend this was smooth.

Management flattened. A few months ago, a third of my time was in alignment meetings. Discussing trade-offs. Debating priorities. Disagreeing about technical decisions. Those conversations are necessary in a traditional model. They are also draining.

Today I still talk to my team. We talk about other things. Design. Product direction. What the agents keep getting wrong. Why a junior engineer shipped three features before the senior finished reading the plan. We get along better because we stopped arguing about work that can be resolved by running a skill.

Uncertainty is real. When I stopped doing line-by-line review every day, some people felt uncertain. What does the lead engineer not reviewing my code mean? What is my value in this new world? Reasonable concerns. I don’t have a clean answer for them. The transition creates anxiety.

The one principle I hold: we don’t fire an engineer because they introduced a production bug. We improve the review process. The same applies to AI. When an agent makes a mistake, we build better validation, clearer constraints, stronger observability. The mistake is a signal about the harness, not a verdict on the agent.

Relationships got better, not worse. Less arguing about trade-offs that the system can resolve. More conversation about what matters.

Beyond Engineering

I see teams adopt AI-first engineering and leave everything else manual.

If engineering ships features in hours but marketing takes a week to announce them, marketing is the bottleneck. If the product team still runs a monthly planning cycle, planning is the bottleneck. If one function operates at agent speed and another at human speed, the human-speed function constrains everything.

Our weekly engineering reviews are AI-generated from repo activity, error data, and team chat. Release notes draft themselves from merged PR titles. Analytics summaries surface the same day the data does. The goal is simple: every function runs on the same kind of harness the engineers run on, or it becomes the new bottleneck.

Engineering was the first domino. It is not the last one.

Three Things That Hold

Three principles have survived every iteration of the harness.

Velocity is capped by the slowest function. Once engineering ships in hours, anything still operating in days becomes the constraint. Speed is a pipeline property, not an engineering one.

Re-engineer, don’t bolt on. Adding AI to your existing process gets you ten or twenty percent. Redesigning the process around AI is multiplicative. The difference is whether you touched the shape of the work.

Adaptability beats accumulated skill. The engineers adapting fastest are not the ones with the deepest traditional practice. They are the ones willing to let the agent do the keystrokes and redirect their judgment elsewhere.

What This Means

For engineers. Your value is moving from code output to decision quality. The ability to write code fast is worth less every month. The ability to evaluate, criticize, and direct is worth more. Product taste matters. Can you look at a generated UI and know it is wrong before the user tells you? Can you look at an architecture proposal and see the failure mode the agent missed? Those skills compound.

For engineering leaders. If your planning cycle takes longer than your build time, that is the first thing to fix. Build the testing and review harness before you scale agents. Fast AI without fast validation is fast-moving technical debt. Start with one architect: one person who builds the system and proves it works. Onboard others into operator roles after the system is running. Push AI-native into every function. Expect resistance.

For the industry. OpenAI, Anthropic, and multiple independent teams have converged on the same principles: structured context, specialized agents, persistent memory, execution loops, hard gates. Harness engineering is becoming a standard. Model capability is the clock driving this. Most of what works at BRIDGE IN today was not possible six months ago. The next generation of models will push it further.

We’re Early

Most engineering leaders I talk to still operate the traditional way. Some are thinking about making the shift. Very few have actually done it.

The tools exist. Nothing in our stack is proprietary. The competitive advantage is the decision to redesign everything around the tools, and the willingness to absorb the cost. The cost is real: uncertainty among engineers, a lead spending more time building systems than managing people, senior engineers questioning their value, a stretch where the old system is gone and the new one is not yet proven.

We absorbed the cost. The pipeline speaks for itself.

We’re building an operations platform. We’re running our own operation like one.

Two Months Later, My AI Team Stopped Being a Pipeline

joaofogoncalves — Thu, 16 Apr 2026 00:00:00 GMT

1. A different Tuesday

In February, I wrote about watching a project manager agent spawn three planners in parallel, wait, sequence a backend pass, then a frontend pass, then optimization, tests, review, docs. An hour of work, no human touching the keyboard, one PR at the end.

That was the story: a pipeline. Feature in, code out.

Two months later, there is no start button anymore.

On a Tuesday in April, my laptop is closed. A Sentry alert fires at 10:17. An autonomous loop picks it up within sixty seconds, reads three Slack channels for context, finds a related thread from last week, creates a GitHub issue linked to both. Ten minutes later, a different skill scans the backlog, claims the top-priority bug, announces the claim in #eng-team with a reply to the original report, and kicks off a build. CI goes green at 11:02. Labels advance. A human hits approve with a thumbs-up reaction on Slack. Merge. Deploy.

I read the whole thing from my phone at lunch.

The pipeline still exists. It’s just one organ of something bigger now.

2. Pipelines deliver. Nervous systems sense and act.

The thesis of the first article was that coordination is the real problem, and a team of specialized agents with gates between them beats a single giant prompt. That’s still true. It’s also insufficient.

A pipeline assumes an input. Someone hands it a spec. It delivers code. When the spec runs out, the pipeline stops. It has no idea what’s happening in Sentry, in Slack, in the backlog, in the last week of merged PRs. It can’t notice things. It can’t follow up. It can’t review its own past work and change its future behavior. It is, in the most literal sense, passive.

A nervous system is always on. It senses. It reacts. It remembers. It adapts.

The shift I didn’t see coming in February is that once you have specialized agents that can deliver a feature, the next hard problem isn’t making the delivery smarter. It’s everything around delivery: deciding what to build, knowing when something breaks, handling follow-ups while five other things are in flight, noticing that the same mistake keeps happening and fixing the mistake-maker instead of the mistake.

3. The three organs

The system grew three distinct subsystems since February. None of them existed in the original article.

Sensors. A skill called /heartbeat runs on a schedule. It polls Sentry for new or escalating errors. It reads four Slack channels: #product, #product-monitoring, #eng-team, #product-development. It handles @mentions in-thread. It follows up on pending threads waiting for clarification or approval. It asks questions when feedback is ambiguous. When it finds something worth acting on, it creates or updates a GitHub issue and links the Sentry event to it. Then it writes everything it learned to a persistent gist so the next run knows what it already handled.

Reflexes. A skill called /pick runs against the backlog. It sorts by priority, claims the top unassigned bug or improvement or research issue, announces the claim in Slack as a reply to whoever reported it, and invokes /build --auto. The build skill then runs the old pipeline: plan, implement, test, review, merge. Same shape as February. Now it runs without a human kicking it off.

Memory. Two pieces. The gist state keeps the sensor’s context across runs (last-read timestamps, pending threads, per-user nudge budgets so the bot doesn’t pester the same person twice in an hour). And a skill called /weekly-agent-review that reads the last seven days of agent work, compares it against human corrections on PRs, and proposes updates to the agent definitions themselves.

The last part is the one that surprised me. The system reviews itself and edits its own source.

4. Specialization is evolution, not a destination

The February roster was twelve agents. I thought it was complete.

It wasn’t. The roster now sits at fourteen, and the additions tell you exactly where the previous version was thin.

ci-fixer showed up in March. The old system trusted the quality-checker to catch everything locally. In practice, CI failed for reasons the local check couldn’t reproduce (flaky tests, migration conflicts, version drift between environments). A human kept jumping in to fix CI. So the agent that reads CI logs, diagnoses the failure, and applies a minimal fix became its own thing.

conflict-resolver showed up the same week. Same pattern. Merge conflicts during long feature branches kept requiring human judgment. Now an agent that understands both sides of a conflict and applies the correct resolution strategy handles most of them.

github-actions-expert is the security specialist. It pins action versions, sets OIDC auth, enforces least-privilege permissions. That work used to be whoever merged the PR. Now it has an owner.

tester is the subtle one. There was already a test-writer — it planned test coverage. But planning and writing were the same agent, which meant in practice the planning was rushed because the model wanted to get to the writing. Splitting them produced better tests.

The lesson isn’t that twelve was wrong. Twelve was right for what the system was doing in February. The lesson is that you add a specialist every time a generalist keeps producing the same class of error. The roster is a ledger of past failures.

5. Autonomy is a dial, not a switch

The February system had one mode: run the pipeline end-to-end. Reviewing the plan before implementation was optional and manual.

The current /build has three modes, and by default it picks one based on the complexity of the issue:

--auto       No prompts. Auto-merge on green CI. Default for Bug / Improvement.
--guided     Checkpoints before implementation and before merge.
--review     Full checkpoints: plan review, implementation, merge.

The dial matters because the cost of a wrong auto-merge is not symmetric with the cost of a slow merge. A one-line fix to a typo doesn’t need three human checkpoints. A refactor touching thirty files does.

The system reads the issue type, estimates complexity from the spec length and the files it expects to touch, and sets the mode. The human can override. Most of the time, nobody does.

This is the piece that most teams skip. They pick a single autonomy level for everything and then fight about whether it’s too much or too little. The answer is almost always: it depends on the work.

6. What a real day looks like

The pipeline view, in February, was a single feature flowing through stages. The nervous-system view is harder to draw because there is no single flow. There are dozens of partial ones, overlapping, all the time.

Here’s a compressed Tuesday:

08:14  heartbeat wakes, reads Sentry, spots a new 500 on payroll export
08:14  cross-references #product-monitoring for complaints — finds one from last week
08:15  creates GH issue #2847, links Sentry group, replies to Slack thread
09:02  /pick claims #2847, announces in thread: "on it"
09:04  /build --auto starts. Bug path: skip planning agents, go straight to fix
09:19  backend fix committed, tester writes regression test, both pass locally
09:21  PR opened, CI running
09:33  CI fails on a flaky migration check
09:33  ci-fixer reads the log, identifies a race in the test fixture, patches it
09:41  CI green
09:42  code-reviewer approves, no critical issues
09:43  PR marked ready for review, slack notification posted
10:11  human taps thumbs-up on the slack message
10:12  auto-merge fires, deploy pipeline takes over

14:00  /heartbeat wakes again — different mentions, different context
17:30  /pick wakes, picks the next bug, cycle repeats

Friday: /weekly-agent-review reads the week,
        notes that ci-fixer handled 6 of 7 CI failures without human touch,
        notes that the 7th required a human because of a secret rotation,
        proposes adding a "secrets expert" specialist to next week's roster

Three things stand out when you compare this to the February version.

First, no human kicked off #2847. The system created the ticket, assigned it, fixed it, and queued it for review. The human role collapsed to a single thumbs-up reaction at 10:11.

Second, the CI failure at 09:33 would have stopped the February pipeline until a human intervened. Now a specialist handles it.

Third, the Friday step is the one that would have sounded like science fiction two months ago. The system audits its own output and proposes changes to itself. I approve or reject the proposal. The system applies the change. Next week’s version of the system is not the version I wrote.

7. What the nervous system still can’t do

Every honest piece about an agent system needs this section, so here it is.

The weekly-review loop is only as good as the signal it reads. It’s good at catching “the same class of error keeps happening.” It’s bad at catching “this entire approach is wrong.” Strategic drift doesn’t show up in PR diffs. You still need a human who reads broadly and notices that the system is efficiently going in the wrong direction.

The autonomy dial makes smart defaults, not correct ones. A simple-looking bug that turns out to have load-bearing consequences will sail through --auto and cause exactly the outage the old pipeline would have avoided, because there was no plan-review step. The dial reduces friction on the common case at the cost of making the uncommon case more dangerous. Both things are true.

Persistent state via gist is fragile. It works. It’s also a system that stores critical coordination data in a single markdown file on a third-party service, and I have not yet had the failure mode where that file goes bad. When it happens, the whole nervous system goes blind for a cycle.

The new specialists (ci-fixer, conflict-resolver) are narrow. They handle the shape of failure they were built for and nothing else. Every new class of failure is a new specialist or a human. There is no general-purpose “something went wrong, figure it out” agent that works at this quality level yet.

And the bill is higher. A nervous system that runs continuously costs more tokens than a pipeline that only runs when you ask it to. The work it does has to be worth the ambient cost of it running at all. For a product team shipping daily, it is. For a side project that ships twice a month, it isn’t.

8. Back to Tuesday

In February, the point of the article was that the gap between “AI writes code” and “AI delivers features” is a systems design problem. The models were capable enough. What was missing was the organizational layer: who does what, in what order, with what inputs, through what gates.

Two months later, there’s a bigger gap on the other side of the pipeline.

The gap between “AI delivers features on request” and “AI runs a product development loop” is also a systems design problem. Same shape. Different scope. What’s missing is the sensing layer, the memory layer, the self-review layer. The organs that make a system aware of itself and its environment.

You don’t need to build all of this to start. Start with one sensor. A skill that watches one channel. A reflex that claims one kind of work. A weekly review that reads the last seven days and tells you what it noticed.

Grow it when you hit a specific failure mode. Same principle as the February article, applied one level up.

The pipeline gets you features. The nervous system gets you a company that ships while you’re at lunch.

The next time you watch an AI agent deliver a feature, ask yourself: who told it to? If the answer is still “a human, every time,” you have a pipeline. That’s fine. It’s just not the end of the road.

Every Company Is Three Things. AI Just Made That Obvious.

joaofogoncalves — Sun, 12 Apr 2026 00:00:00 GMT

Ask any executive what their company does and you’ll get an org chart, a mission statement, maybe a slide deck about culture and values. Ask AI to figure it out and you get something simpler.

AI is stripping companies down to three things. Not three departments. Not three product lines. Three categories of knowledge that determine whether the company actually works.

Expert Knowledge. Tribal Knowledge. And the Hardware and Software that stitches them together.

That’s it. Everything else is scaffolding.

The reason this matters now is that AI doesn’t need the scaffolding. It needs the knowledge. And most companies, when they go looking for it, discover they have far less of it documented than they assumed.

What you know how to do

Expert Knowledge is the vertical-specific understanding of how to operate in your domain. It’s what you’d teach in a textbook if the textbook existed. How to underwrite a loan. How to design a bridge truss. How to diagnose a failing compressor. How to price a derivatives contract.

This is the kind of knowledge that industries build certification programs around. It’s teachable, testable, and increasingly available to AI. If your expert knowledge lives in published standards, regulatory frameworks, or well-documented procedures, a foundation model can learn it. Not perfectly. But well enough to shift the value curve.

McKinsey’s 2025 State of AI report found that 88% of organizations now use AI in at least one function. The knowledge that used to differentiate a senior hire from a junior one is increasingly embedded in the tools both of them use. If it’s in a textbook, it’s in a model. The moat has to come from somewhere else.

AI makes this visible because it’s the first thing companies try to feed it. They point a model at their documented procedures and get decent results fast. Which feels like progress until they realize AI just commoditized the layer they thought was their advantage.

The interesting question isn’t whether AI can access your expert knowledge. It’s whether your expert knowledge is all you have.

What nobody wrote down

Tribal Knowledge is the other kind. It’s the fuzzy grey details that exist in people’s heads, in Slack threads, in the way a senior engineer configures a deployment differently from the documentation, in the reason one sales team consistently outperforms another selling the same product in the same market.

This is the knowledge that is poorly documented because it’s hard to articulate. It lives in intuition built over years of pattern matching. The machinist who can hear when a tolerance is off. The underwriter who knows which data points actually predict default versus the ones the model says should. The project manager who understands which stakeholder needs to be looped in before a decision sticks.

The California Management Review called tacit knowledge your next competitive moat. They’re understating it.

This is where performance dispersion lives. Two companies in the same vertical, same market, same tools, same certifications. One consistently outperforms the other by 20, 30, 40 percent. The spreadsheet can’t explain why. But the people inside both companies know. The difference is that one company has tribal knowledge distributed across its teams. The other lost it when three senior people left and nobody thought to ask them how they actually did the work.

AI exposes this layer by subtraction. Companies deploy a model, feed it their documented knowledge, and the output is competent but generic. The gap between what the AI produces and what the best people produce is tribal knowledge made visible. It’s the delta you can suddenly measure but can’t yet close.

And two converging forces are making this urgent: AI integration is exposing the gaps, and mass retirements are draining the institutional expertise that filled those gaps invisibly for decades. The knowledge walks out the door, and the org doesn’t even know what questions to ask about what it lost.

The glue layer

The third element is Hardware and Software. The CRM, the ERP, the custom internal tools, the spreadsheets nobody admits are running critical processes, the deployment pipelines, the communication platforms. This is where knowledge becomes action.

AI is embedded in this layer now. It’s dramatically better glue. It can stitch Expert and Tribal Knowledge together faster, more consistently, and at a scale that wasn’t possible when the glue was entirely human coordination.

Deloitte’s 2026 State of AI report found that 66% of organizations report productivity gains from enterprise AI adoption. But look at where those gains come from: faster information retrieval, better routing of decisions, more consistent application of known procedures. The knowledge itself didn’t change. The speed at which it could be applied did.

The companies seeing transformational results did something different. McKinsey found that high performers are 3.6 times more likely to pursue transformational change, and 55% fundamentally rework their workflows when deploying AI. They’re not upgrading the glue layer. They’re rebuilding it around their documented knowledge.

That’s a different thing entirely.

What happens when you document both

If Expert Knowledge and Tribal Knowledge are both well documented, something changes. You can rebuild how you do work.

Not optimize. Not automate a few steps. Rebuild.

The progression is straightforward. First you structure the knowledge into forms AI can consume: fine-tuned models, knowledge graphs, retrieval systems. Then you convert it into decision frameworks. Then AI can execute multi-step workflows independently. But the whole chain breaks at step one if the knowledge doesn’t exist in a form that can be structured.

When a company understands both what to do (Expert Knowledge) and how it’s actually done, including the judgment calls, the exceptions that everyone knows but nobody wrote down (Tribal Knowledge), it can redesign its processes from first principles. This is what most AI transformation projects miss. They start with the tools. They buy an AI platform, connect it to their data, and then wonder why the outputs are generic, hallucinated, or irrelevant to how the business actually operates. The issue isn’t the AI. The issue is that nobody documented the knowledge the AI needs to be useful.

The organizations that have done this work are seeing results that look like a different category. The California Management Review cited a case where regulatory evaluations in the cosmetics industry scaled from hundreds to over 40,000 per month with full accuracy and repeatability. Expert workload dropped by roughly 80%. But that only worked because both the expert knowledge (regulatory frameworks) and the tribal knowledge (how experienced evaluators actually interpret edge cases) were codified before the AI was deployed.

Companies with established knowledge layers deployed new AI applications in weeks because more than half of their semantic structure could be reused across use cases.

This isn’t optimization. This is reconstruction. Same company, same people, fundamentally different capability.

More people, not fewer

Here’s the part that most people get wrong.

The assumption is that documenting knowledge and adding AI means you need fewer humans. The math seems obvious. If AI handles the execution, why hire people to do it?

But that’s not what happens in practice. What happens is the bottleneck moves.

When Expert and Tribal Knowledge are well documented and AI handles the execution layer, the constraint shifts to judgment. Not the routine judgment that can be encoded in decision trees. The novel judgment. The edge cases. The strategic calls. The situations nobody anticipated when the knowledge was written down.

Judgment doesn’t scale without people.

A company that has documented its knowledge well can hire faster because onboarding changes. Instead of spending six months absorbing tribal knowledge through osmosis, sitting next to the senior person, learning the unwritten rules, a new hire can get to the judgment layer in weeks. The documented knowledge accelerates them to the point where they can start making meaningful decisions sooner.

This is a growth argument. Companies with well-documented knowledge can absorb more people because each person reaches productive judgment faster. The documented knowledge is the scaffold. The humans provide the judgment. You need more of them because the opportunity space expands when the knowledge bottleneck clears.

Both things are probably true. AI will eliminate some roles where the work is pure execution against well-known procedures. And AI will create demand for more people in organizations where documented knowledge reveals how much novel judgment is actually required to operate well.

The companies that understand this will grow faster. The ones that treat documentation purely as a cost-cutting exercise will find they’ve automated the easy part and hollowed out their capacity for the hard part.

The rebuild

Every company looks complex from the outside. Org charts, processes, culture, technology stacks, decades of accumulated decisions layered on top of each other.

Strip away the scaffolding and there are three things. What you know how to do. What nobody wrote down. And the tools you use to stitch them together.

AI didn’t create these layers. It revealed them. And the companies that understand what they’re actually made of are the ones that can remake themselves.

The rest are still optimizing something they can’t describe.

Pain Gets You In The Door. Curiosity Builds Everything Else

joaofogoncalves — Thu, 26 Mar 2026 00:00:00 GMT

There’s a post circulating where someone describes their CEO checking everything through AI. Website copy. Campaign strategy. Webinar talking points. Every time someone presents work, the response is the same: “Have you asked AI what it thinks?”

The team is frustrated. Demoralized. I get it.

I also think the CEO might be the most rational person in the room.

He found something that gives direct answers. That doesn’t ask for three more days. That doesn’t say “we have experts on the team.” He isn’t obsessed — he’s just less tolerant of slow feedback than he used to be. Both things are probably true. The frustration just runs in opposite directions.

But here’s the part worth thinking about: his relationship with AI will probably stay exactly as shallow as it is right now. Because what drove him there was pain. And pain is a poor architect.

Pain lowers the activation energy

Organizations love the pain frame when it comes to adoption. Make the old way uncomfortable enough, and people will reach for the new one. It follows standard diffusion logic — people change when the cost of not changing exceeds the friction of changing.

And it works. Partially.

Pain gets someone to open the tool instead of dismissing it. Mandate usage metrics, change the approval process, put an impatient CEO in the room. People will start using it.

The numbers back this up. According to McKinsey’s 2025 State of AI report, nearly 90% of organizations now use AI regularly in at least one function. Adoption, by most surface measures, is nearly universal.

But a 2025 study of 10,000 employees by the ifo Institute found something that sits awkwardly next to that headline: while 64% of workers use AI tools, only 20% use them frequently or intensively. Broad diffusion. Shallow use. The same tool, open in millions of browser tabs, barely touched beneath the surface.

McKinsey puts the scaling number at 7%. That’s the share of organizations that have successfully embedded AI across the enterprise — not just piloting it, not just reporting usage, but actually running on it.

The gap between 90% and 7% is the whole article.

What pain actually produces

Pain-driven adoption has a specific texture. The ifo Institute research names it directly: cognitive inertia. When organizational pressure is the primary driver, users rely on AI to complete tasks with “minimally acceptable quality” rather than using it to think differently about the problem. The tool becomes a shortcut, not a collaborator.

The same study found that top-down formal adoption — mandates, training programs, supervision — determines how deeply AI is embedded in daily workflows. But it doesn’t broaden the pool of people actually using it well. That part is driven by something else: informal, employee-led experimentation. People going off-script. Trying things nobody asked them to try.

In other words, pain sets the floor. It does not raise the ceiling.

The CEO checking every presentation through AI isn’t building capability. He’s building a habit. Those are different things. One compounds. The other plateaus.

What curiosity actually produces

A 2026 study published in Frontiers in Psychology identified intrinsic motivation as the strongest single predictor of creative and frequent AI use. More than access. More than training. More than organizational mandate.

And critically: when extrinsic pressure becomes the dominant driver, it can actively weaken the effect of intrinsic motivation on creativity. The mandate crowds out the curiosity. You get a utilitarian atmosphere — people using AI to get things done — where what you actually needed was people using AI to figure out what’s worth doing differently.

This distinction shows up clearly in what separates high performers from the rest. Carnegie Mellon’s AI Maturity research categorizes only 8% of organizations as “reinvention-ready.” What sets them apart isn’t better tools or bigger budgets. It’s a shift from what the research calls horizontal layering — adding AI on top of existing silos — to vertical integration, where teams take end-to-end ownership of AI-augmented workflows and actively redesign how work gets done.

The productivity gap between those two modes is not marginal. It ranges from 8x to 33x reductions in resource consumption. Same technology. Completely different relationship to it.

The uncomfortable part

Here’s where this gets genuinely difficult: you can manufacture pain. Restructure a workflow. Change an approval process. Hire an impatient CEO. Create urgency.

You cannot manufacture curiosity.

A 2026 study confirmed what most practitioners already sense: the psychological pathway to sustained AI adoption runs through motivation internalization. External pressure initiates behavior. But what sustains it — what makes it compound rather than plateau — is curiosity and the habits it builds. Intrinsic motivation and behavioral habit are the core nodes that hold the entire psychological adoption network together.

Which means the real question for any organization serious about going past the 7% isn’t “how do we get people to use this.” It’s: where are the people who are already curious, and are we giving them room to run?

Because those people exist in almost every organization. They’ve already figured out something their team hasn’t. They’re using AI in ways nobody asked them to. They probably haven’t told anyone because there’s no incentive to.

Most adoption programs walk right past them. They’re too busy measuring the 90%.

What this means in practice

Compliance-based adoption is real. It generates the numbers that go in board decks. In a narrow sense, it works.

What it doesn’t produce is the person who comes in on Monday having spent the weekend figuring something out. The junior hire who quietly builds a workflow that changes how a whole department operates. The team member who reframes a problem nobody had thought to reframe.

Those people moved because of curiosity. They didn’t need a mandate. They needed room.

The organizations building real AI capability aren’t the ones with the most aggressive adoption programs. They’re the ones where a handful of genuinely curious people went ahead, figured something out, and made it impossible for everyone else to ignore.

Back to the annoying CEO

He’s in the high-pain, low-curiosity quadrant. Fed up with slow, vague answers — and relieved to have something that just responds. Whether that relief becomes genuine curiosity, or stays a coping mechanism for organizational frustration, is an open question.

His team’s frustration is legitimate. But part of what’s underneath it — unspoken — is that they haven’t found their own reason to go deeper. The CEO found his, even if it’s imperfect.

The gap isn’t really between the CEO and the team. It’s between both of them and the version of this that actually compounds.

Pain is the ignition. Curiosity is the engine.

You need both. But only one of them scales.

Is SaaS Dead?

joaofogoncalves — Mon, 16 Mar 2026 00:00:00 GMT

The math changed

Here’s what happened. AI made coding cheap. Not free, not effortless, but cheap enough that the build vs. buy equation flipped for a lot of teams.

Two years ago, building an internal tool to replace a SaaS product meant hiring developers, managing scope creep, and maintaining something forever. The total cost of ownership was brutal. So you bought. Everyone bought. That’s how we ended up with companies running 200+ SaaS subscriptions and a full-time person just to manage them.

Now a senior engineer with an AI coding agent can prototype that same internal tool in a weekend. The build side of the equation dropped by an order of magnitude. The buy side stayed the same, or got more expensive.

When you change one side of an equation that dramatically, everything downstream moves.

What actually dies

The SaaS products most exposed are the ones that were always a thin layer of logic on top of a database. The ones where the value was never the technology. It was the fact that building it yourself was too expensive to justify.

CRUD apps with nice UIs. Dashboards that rearrange data you already own. Workflow tools that are basically a form connected to a database connected to an email.

These were always a convenience tax. AI just made the convenience optional.

And it goes further than that. It will feel archaic in two years that we used to click through user interfaces to navigate databases and complete tasks. Agents just do it. One prompt. Done. 90% of the entire application layer is going to get eaten over the next decade. The dashboards. The forms. The CRUD. All of it.

What doesn’t die

Some SaaS products have moats that AI doesn’t erode. It actually makes them more valuable.

Regulated environments. If your software needs certification, if it operates in a space where compliance isn’t optional, if regulators need to audit your processes, you can’t vibe-code your way through it. A weekend prototype doesn’t pass regulatory review. The certification itself is the product. It takes years to build, and your customers can’t replicate it no matter how cheap their engineers are.

Deep dependency. Some tools wove themselves so deeply into how teams operate that ripping them out is more expensive than any alternative. This isn’t lock-in through malice. It’s lock-in through genuine usefulness. When a product becomes the operating system of a team’s daily work, the switching cost has nothing to do with code.

Humans in the loop. There are domains where the service requires human judgment. Regulatory requirements that demand a qualified person signs off. Subjective decisions where the accumulated knowledge of specialists still matters. AI can assist here, but it can’t replace the human, and in many cases the law doesn’t allow it to. The SaaS layer that orchestrates this human expertise? That’s not going anywhere.

Domain knowledge that’s still out of reach. Some industries have accumulated wisdom that isn’t in any training set. The edge cases, the unwritten rules, the things you only learn by operating in a space for years. Products built on this knowledge aren’t competing with AI. They’re competing with decades of experience, and AI doesn’t have it yet.

From mega to micro

The era of mega-SaaS, the all-in-one platform that tries to be everything for everyone, is the part under real pressure. These products survive because the pain of integrating five smaller tools used to outweigh the pain of paying for features you don’t use. That integration pain is getting cheaper too.

What’s emerging is micro-SaaS. Smaller, sharper products that do one thing well for a specific audience. The economics work because building them is cheaper. The market works because teams can now afford to be picky. Instead of buying the whole suite and using 20% of it, you find the tool that fits your exact problem.

More products competing on actual fit instead of feature count. More teams paying for what they use instead of what they might use someday. That’s healthier.

The inversion

Here’s where it gets interesting. Look at those moats again. Regulated environments. Human expertise. Domain knowledge. What do they all have in common?

The value is the service. The software is the delivery mechanism.

We’ve spent two decades packaging services as software. The whole SaaS model was: take a process that used to require people, automate it, sell the automation. The direction was always service into software.

That direction is inverting.

The real opportunity for most companies isn’t building another SaaS product. It’s taking genuine domain expertise and baking it into AI-powered delivery. Service-as-a-Software.

An ad agency that encodes its winning playbooks into AI systems and serves 1,000 clients with the quality it used to give 10. An IP law firm that packages decades of expertise into AI skill files and delivers legal services at near-zero marginal cost. An HR and payroll company that operates across European regulatory environments, where the humans in the loop aren’t a limitation but the entire product.

The backend is AI. The frontend is your expertise packaged as a service. The moat is that you actually know what good looks like in your domain.

You’re not competing with OpenAI. You’re competing with other service providers who are still doing everything manually. That’s not a hard fight to win.

So is SaaS dead?

SaaS isn’t dead. It’s inverting.

The version that existed because building was too hard, that charged enterprise prices for commodity logic, that bundled everything to justify the price tag? That part is dying.

What replaces it is sharper. Micro-SaaS tools that earn their place by fitting perfectly. And underneath those, a new layer: companies that stopped trying to sell software and started selling their expertise through it.

The technology gets commoditized. The person who knows how to use it doesn’t.

AI Can Write Code. Coordinating AI Agents Is the Actual Hard Part.

joaofogoncalves — Tue, 17 Feb 2026 00:00:00 GMT

1. The Scene

I watched it happen on a Tuesday afternoon. A project manager agent read a feature spec, spawned three planning agents in parallel, and waited for all three to finish. A product manager reviewed their output and sent feedback. Then the system shifted gears: backend implementation, frontend picking up the generated types, optimization, simplification, tests, a quality checker running linting and type checks across the entire codebase, a code review, documentation updates. The system updated the draft PR to ready-for-review, and that was it.

No human keyboard input for over an hour, across 30+ files touching both backend and frontend. And the code was production-quality.

But getting here took six weeks of rebuilding everything I thought I knew about using AI for development.

Here’s the thesis: most teams are optimizing single-agent prompting. The real productivity leap comes from solving orchestration: how multiple specialized agents coordinate, communicate, and quality-gate their own work. This is a systems design problem, and the principles for solving it transfer better than any specific prompt.

I built this system while developing a full-stack SaaS platform: Django on the backend, React on the frontend, the kind of multi-module application where features span models, schemas, endpoints, components, pages, and tests. What follows is the system, how it evolved, what worked, what didn’t, and the principles that transfer regardless of stack or tooling.

2. The Problem: Why Single-Agent AI Hits a Ceiling

The Copilot Plateau

AI-assisted coding has a productivity curve that flatlines faster than most teams expect. The first phase is magical: you ask for a function, it writes a function. You describe a bug, it suggests a fix. You paste an error, it explains it. For isolated tasks (write a utility, add a validation rule, refactor a method), the current generation of AI models is excellent.

The plateau hits when you move from tasks to features.

A feature isn’t a function. It’s a coordinated change across data models, API endpoints, frontend components, routing, state management, tests, and documentation. AI handles a CRUD endpoint beautifully. Ask it to build a cross-module feature, one that touches models, schemas, API contracts, components, and tests in coordinated sequence, and it starts losing coherence. Planning before implementation. Sequencing between layers. Verification after everything is wired together. A single agent holding all of that context doesn’t get slower. It gets confused.

Engineering leaders have felt this gap intuitively. AI helped write a function, great. But who planned which function to write? Who verified it integrates with the existing system? Who checked it didn’t break something three modules away?

The Context Window Trap

A single AI session trying to hold a feature spec, the existing architecture, backend implementation decisions, frontend component patterns, test coverage requirements, and code review criteria simultaneously will lose coherence. This isn’t a model intelligence problem. It’s an information density problem. The more context you load, the less reliably the model attends to all of it.

You’ve experienced this as “the AI forgot what I told it earlier.” At feature scale, it’s more like “the AI is simultaneously trying to be an architect, a developer, a tester, and a reviewer, and doing none of them well.”

The Missing Middle

There’s a gap between “AI writes a function” and “AI delivers a feature,” and it’s not bridged by better models or longer context windows. It’s bridged by answering a set of questions that have nothing to do with the model’s capabilities:

Who plans what to build?
Who builds it, and in what order?
What inputs does each step need?
What gates must pass before the next step begins?
Who verifies the result?

These are orchestration questions. They’re the same questions you’d answer when structuring a human engineering team. The missing middle isn’t intelligence. It’s organization.

3. The Insight: AI Agents Need What Human Teams Need

The breakthrough that changed my approach was obvious in hindsight: AI agents need the same organizational structures as human engineering teams.

A single brilliant engineer cannot replace a well-organized team. They can write excellent code, but they can’t simultaneously hold the product vision, the system architecture, the implementation details, the test strategy, and the documentation plan with equal depth. We solved this in human organizations decades ago with specialization, workflow, and quality gates.

Single-agent AI has the same limitation. It’s not that the model can’t reason about architecture or testing or code review. It’s that asking one agent to do all of these things in a single session produces the same result as asking one engineer to be the entire team: technically possible, practically mediocre.

What do real engineering teams have that single-agent AI lacks? Specialization, for one. An architect thinks differently from a developer, who thinks differently from a QA engineer. Distinct roles produce distinct perspectives. Teams also have workflow: you plan before you build, build before you test, test before you ship. They have communication through artifacts. The architect produces a blueprint. The developer reads the blueprint. The tester reads the spec. They have quality gates, where work doesn’t advance until it meets criteria. And they have state tracking, so everyone knows what phase the project is in.

The design principle that emerged: instead of asking one agent to do everything, build a team of specialists with clear interfaces between them. Define what each agent does, what it receives as input, what it produces as output, and what conditions must hold before the next agent begins.

4. The System: A 12-Agent Orchestrated Workflow

The implementation isn’t a custom Python framework or a wrapper around an API. Each agent is a markdown file, a Claude Code custom agent definition that specifies the agent’s role, constraints, inputs, outputs, and model. The project manager agent orchestrates the workflow by spawning these agents as sub-agents in the prescribed sequence. The entire system is configuration, not code. The tooling here is Claude Code, but the pattern (specialized agents with defined interfaces) applies to any agent framework.

The Agent Roster

The key insight: constraints define roles more than capabilities. The architect’s definition doesn’t just say “you are an architect.” It explicitly prohibits code generation: no Python classes, no TypeScript interfaces, no function signatures, no import statements, no executable code of any kind. The output must be field tables, relationship diagrams, and API operation descriptions. The boundary between “what to build” and “how to build it” is enforced by prohibition, not suggestion.

Similarly, the project manager is told it never implements code directly. It plans, delegates, tracks, and coordinates. Without this hard constraint, the agent reliably drifts into writing code itself instead of delegating to specialists.

The 4-Phase Workflow

The system operates in four phases (plus initialization), each with explicit entry conditions, agent assignments, and gates:

Phase 0: Initialization
├── Create branch, draft PR, state tracker on GitHub Issue
└── Gate: infrastructure ready

Phase 1: Planning (PARALLEL)
├── UI Designer ──────→ design-plan.md ─┐
├── Architect ────────→ system-plan.md  ├→ Product Manager reviews all three
├── Test Writer ──────→ test-plan.md   ─┘
├── Agents revise based on PM feedback
└── Gate: all plans complete and reviewed

Phase 2: Implementation (SEQUENTIAL)
├── Backend Developer → commit + push
├── Frontend Developer → commit + push
├── Code Optimizer → commit + push
├── Code Simplifier → commit + push
└── Gate: each step committed before next begins

Phase 3: QA & Documentation (SEQUENTIAL)
├── Test Writer (implements tests) → commit + push
├── Quality Checker (ENTIRE codebase) → commit + push
├── Code Reviewer → review report
├── Documentation Expert → commit + push
└── Gate: all checks pass system-wide

Phase 4: Ready for Review
└── Draft PR → Ready for Review. Human enters the loop.

The parallel vs. sequential decision is intentional. Planning is embarrassingly parallel. Three agents creating independent perspectives on the same spec don’t need to coordinate. The architect doesn’t need the UI designer’s output, and vice versa. But implementation is intentionally sequential: the frontend developer needs the backend’s API types, the optimizer needs the code to exist before optimizing it, and the quality checker needs everything in place before running system-wide checks.

Here’s what the phase structure looks like in the actual project manager definition:

Phase Phase Label Project Status 0 phase:0-init Backlog 1 phase:1-planning Planning 2 phase:2-implementation Implementation 3 phase:3-qa-docs QA & Documentation 4 phase:4-ready-for-review Ready for Review

Each phase transition updates labels on the GitHub Issue, moves the item on the project board, and updates a state tracker comment via the GitHub API.

The Communication Layer

Agents don’t talk to each other directly. They communicate through named artifacts stored in defined locations: planning artifacts on the GitHub Issue, implementation artifacts on the Pull Request.

Here’s the actual artifact mapping from the project manager’s definition:

Source File Location Feature spec pm-spec.md Issue Execution state tracker project-state.md Issue @product-manager output pm-plan.md Issue @ui-designer output design-plan.md Issue @architect output system-plan.md Issue @test-writer output test-plan.md Issue @backend-developer output backend-implementation.md Pull Request @frontend-developer output frontend-implementation.md Pull Request @code-optimizer output optimization-implementation.md Pull Request @code-simplifier output simplification-implementation.md Pull Request @test-writer output tests-results.md Pull Request @quality-checker output quality-checks.md Pull Request @code-reviewer output code-review-writing.md Pull Request @documentation-expert output documentation-writing.md Pull Request

Each agent reads specific inputs and writes a specific output. The project manager orchestrates sequencing; communication is asynchronous and artifact-based. This makes the entire system debuggable. You can inspect any artifact to understand exactly what an agent received as input and what it produced.

5. Five Hard-Won Principles

5a. Constraints Over Capabilities

The most important word in every agent definition is “NOT.”

The architect’s definition doesn’t just describe what an architect does. It explicitly enumerates what the architect must never produce: no Python class definitions, no Pydantic schema code, no TypeScript interfaces, no function signatures, no import statements, no executable code of any kind. The output format is prescribed as field tables, relationship diagrams, and API operation descriptions. Never implementation.

The project manager’s core rule: “Never implements code directly — plans, delegates, tracks, coordinates.”

Without these hard constraints, agents drift into each other’s territory. The architect starts writing code. The project manager starts implementing instead of delegating. The boundaries blur and you end up back where you started.

Define what an agent cannot do more carefully than what it can do.

5b. Gates Over Trust

Every phase transition has an explicit gate. Planning doesn’t advance to implementation until all three plans are complete and reviewed. Each implementation step must be committed before the next begins.

But the most important gate is in Phase 3. Here’s what the quality checker’s definition mandates:

## Critical Rules
1. **ENTIRE codebase** – Run on ALL files, not just changed files
2. **NEVER use `--select`** or file-specific flags during checks
3. **NEVER skip a check** – All 6 checks must pass
4. **NEVER delete or skip tests** to make the suite pass
5. **Fix pre-existing issues** – If the codebase has pre-existing failures, fix them
6. **No partial success** – Either ALL checks pass or the task is not done

Six checks (linting, formatting, type checking, and tests for both backend and frontend) must pass across the entire codebase, not just the files changed in this feature. A new feature cannot silently break existing functionality.

Without gates, agents optimistically advance and quality degrades silently. They’ll report “looks good” when it isn’t, because the default behavior of a language model is to be helpful and move forward, not to block progress.

Don’t trust agents to self-assess quality. Build external verification into the workflow.

5c. Artifacts Over Conversation

Agents communicate through named artifacts: design-plan.md, system-plan.md, backend-implementation.md. Each agent has defined inputs and outputs. The architect reads pm-spec.md and produces system-plan.md. The backend developer reads pm-spec.md and system-plan.md and produces backend-implementation.md.

This matters because of debuggability. When something goes wrong (and it will), you can inspect any artifact to understand what an agent saw, what it produced, and where the chain broke down. Conversational context is ephemeral. Artifacts persist, and you can version them and inspect them after the fact.

Treat agent outputs as contracts. Name them, store them, make them inspectable.

5d. Scoped Changes, Global Verification

The code optimizer and code simplifier are scoped to “only files modified in this feature.” They don’t touch the rest of the codebase. But the quality checker runs on the entire codebase. Every lint rule, every type annotation, every test, across all modules.

This is an intentional asymmetry. Agents that change things should be scoped tightly. You don’t want an optimizer “improving” code in unrelated modules. But agents that verify things should run broadly, because you need to know that your changes didn’t break something elsewhere.

Scope changes narrowly. Verify broadly.

5e. Model Selection Is a Lever

Not every agent needs the most powerful model. Claude’s model tiers range from Opus (most capable, slower, higher cost) to Sonnet (fast, cheaper, still strong for implementation tasks). Here’s the actual assignment from the system:

## Architecture & Documentation — most capable model
- **architect:**            `opus`
- **documentation-expert:** `opus`

## Everything else — faster model
- **project-manager:**      `sonnet`
- **product-manager:**      `sonnet`
- **backend-developer:**    `sonnet`
- **frontend-developer:**   `sonnet`
- **test-writer:**          `sonnet`
- **code-reviewer:**        `sonnet`
- **code-optimizer:**       `sonnet`
- **code-simplifier:**      `sonnet`
- **quality-checker:**      `sonnet`
- **ui-designer:**          `sonnet`

The architect uses the most capable model because its job is structural reasoning about system boundaries and data flows. The project manager uses a faster model because it follows a defined workflow and updates state. That’s coordination, not cognition.

Match model capability to task complexity. Over-powering coordination agents wastes tokens and adds latency.

6. The Evolution: How the System Grew

Week 1: The Starting Six

The system began with six agents: a feature orchestrator, a backend implementer, a frontend implementer, a code optimizer, a code reviewer, and a test writer. The orchestrator tried to do too much: planning, coordinating, and making product decisions all at once.

The names reflected implementation thinking. “Backend implementer” sounds like a function call. “Feature orchestrator” sounds like middleware. These weren’t just labels. They shaped how the agents behaved.

Week 2: The Communication Crisis

The original system stored state in files on disk. Planning artifacts lived in a local directory. State tracking was a markdown file that agents read and updated. This worked for a single feature in isolation. It broke the moment anything went wrong. A failed step, a context window limit, a session restart.

The turning point was realizing that GitHub Issues and Pull Requests are already a communication layer designed for multi-person coordination. They have comments, labels, project boards, state tracking, and an API for programmatic updates.

I shifted all artifacts to Issue and PR comments. State tracking moved to a single updatable comment via the GitHub API. Planning artifacts posted to the Issue. Implementation artifacts posted to the PR.

This was the single biggest improvement in the system’s reliability. Not because GitHub is a better file system, but because it’s a communication layer that already handles persistence, visibility, linking artifacts to context, and surviving session restarts.

Week 3: The Specialization Expansion

The system grew from 6 to 12 agents. The most important additions were the architect and the product manager.

The architect was the breakthrough. Before it existed, the backend developer was simultaneously deciding what to build and how to build it. Separating architectural decisions from implementation eliminated an entire class of coherence problems. The architect reasons about data models, API contracts, and system boundaries. The developer reads that blueprint and implements it. Each does one thing well.

The product manager created a similar separation: what the user needs vs. what the system needs. The PM reviews all plans through a user-centric lens. Without it, technical elegance sometimes won over user value.

Week 4: The Naming Revelation

I renamed “feature-orchestrator” to “project-manager” and “backend-implementer” to “backend-developer.”

This sounds trivial. It wasn’t.

Names shape behavior. “Implementer” tries to implement. It’s a verb, an action, a narrow mandate. “Developer” thinks about development more holistically. The same agent definition with a name change produced noticeably different outputs. The “backend-developer” more naturally considered edge cases and integration points that the “backend-implementer” had ignored.

Similarly, “feature-orchestrator” sounded like infrastructure. “Project-manager” sounded like a role with judgment and responsibility. The agent started making better sequencing decisions.

Weeks 5-6: Quality Layers and Hardening

The final evolution added the code simplifier and the quality checker, and introduced system-wide quality checks, where verification must run on the entire codebase, not just changed files. Early versions also tried to parallelize implementation (backend and frontend simultaneously), which failed because the frontend depends on backend types. Understanding which steps are truly independent took weeks of trial and error. By week six, features started shipping end-to-end.

Key lesson: You cannot design a multi-agent system from scratch. You grow it. Start with minimum viable orchestration and add complexity only when you hit a specific failure mode.

7. What Didn’t Work (And Still Doesn’t)

Honest assessment.

Token costs are significant. A 12-agent workflow for a single feature uses substantially more tokens than a single-session implementation. The economics work for complex features that span multiple modules. They don’t make sense for small changes, bug fixes, or isolated tasks. Knowing when to use the pipeline vs. a single session is itself a skill.

The orchestrator is a bottleneck. The project manager runs sequentially by design. It can’t delegate Phase 2 work until Phase 1 completes. On long workflows, it sometimes loses track of state across extended sessions. Context limits are real. When the orchestrator gets confused, the whole pipeline stalls.

Parallel planning isn’t truly parallel. The sub-agent system has coordination overhead. Three agents spawned “in parallel” are faster than running sequentially, but not 3x faster. There’s setup cost, context loading, and result collection.

Error recovery is manual. When an agent fails (a test it can’t fix, a type error it doesn’t understand), the project manager doesn’t always recover gracefully. Human intervention remains the escape hatch. The system is semi-automated with a human on call.

Agents produce plausible-but-wrong output more often than you’d expect. The architect occasionally designs data models that look reasonable but miss an existing pattern in the codebase. The backend developer sometimes generates an API contract that doesn’t match what the frontend needs. These aren’t hallucinations in the obvious sense. They’re coherent, well-structured, and wrong. The quality gates catch most of it, but human review at the PR stage catches the rest. The system reduces human effort; it doesn’t eliminate human judgment.

Agent definitions are tightly coupled to the project. They reference specific file patterns, naming conventions, import paths, and quality check commands. This isn’t a portable framework. It’s a bespoke system for a specific codebase. Porting to another project means rebuilding most of the definitions. The principles transfer; the definitions don’t.

8. Where This Is Heading: From Manual Orchestration to Native Agent Teams

Everything described so far is manual orchestration. Agent definitions written by hand. Workflow encoded in markdown. Sequencing managed by a project manager agent following instructions. It works, but it’s held together by carefully crafted prompts, not by infrastructure.

Claude Code recently shipped an experimental feature called Agent Teams that provides native infrastructure for exactly this pattern. What Agent Teams adds beyond what I built manually:

Native split-pane display for monitoring all agents at once
File locking to prevent write conflicts during parallel work
Plan approval workflows where teammates plan in read-only mode until the lead approves
Automatic task dependency resolution, so completed tasks unblock downstream work without manual intervention

What this suggests: the pattern of specialized agents with structured workflows is not a hack or a workaround. It’s the direction the tooling is moving. Claude Code is building Agent Teams. CrewAI, LangGraph, and AutoGen are converging on similar primitives: role specialization, task dependencies, quality gates, scoped delegation. The specifics differ, but the shape is the same.

Teams that learn orchestration patterns now will be ready to adopt native tooling as it matures. The hard part was never the tooling. It’s understanding which workflows benefit from parallelism, where gates should go, how to scope agent responsibilities, and how to design the artifact interfaces between them. That understanding transfers regardless of whether you’re wiring agents together with markdown prompts or with a native team coordination layer.

9. The Orchestration Imperative

AI agent orchestration, not single-agent prompting, is the productivity lever engineering teams should invest in learning.

The gap between “AI writes code” and “AI delivers features” is a systems design problem. The models are capable enough. What’s missing is the organizational layer: who does what, in what order, with what inputs, through what gates, producing what outputs.

You don’t need 12 agents to start. Start with 3: a planner, a builder, and a reviewer. Add specialization when you hit a specific failure mode. When the planner is simultaneously making product decisions and architectural decisions, split it into two agents. When quality keeps slipping, add a dedicated checker. When the builder is deciding what to build while building it, add an architect. Grow the system; don’t design it from scratch.

The principles transfer across any tooling. Constraints over capabilities. Gates over trust. Artifacts over conversation. Scoped changes with global verification. Model selection as a lever, not an afterthought.

The next time you watch an AI agent generate code, ask yourself: who reviews this? What happens next? Where does it go? If you don’t have answers, you have an orchestration problem.