The case for AI pilots that actually ship

This is the boring half of AI implementation — the half nobody writes about. The framing on most stages, in most articles, in most boardrooms is still about what AI can do. The actual problem on the floor is a much smaller, much less photogenic question: which of the things AI can do will still be in production six months from now? Because the rest is a line on the budget that closes without anybody being quite sure what it produced.

I want to spend this piece on the difference between the pilots that ship and the pilots that don't. Not because I want to be skeptical about AI — I run an AI practice and I think the tools are extraordinary. I want to be skeptical about pilots. The two are not the same thing.

Where pilots actually go to die

The death certificate on most AI pilots is signed in one of four places.

Place one: integration debt. The pilot ran in a sandbox. The classifier worked on a CSV of historical inquiries. Beautiful F1 score on the test set. Production integration was scoped at "two weeks." Three months later, the pilot is still waiting on permissions to read the CRM, the team has lost momentum, and the original sponsor is justifying the slip in budget review meetings.

Place two: data the model never saw. The model was trained on 800 polished examples. Production traffic includes informal Croatian-English code-switching, attachments the OCR can't parse, and edge cases that account for 12% of volume. The pilot's accuracy drops twenty points. The team's confidence drops further. Quiet retreat.

Place three: nobody owns the workflow. The AI generates a suggestion. The agent is supposed to review and send. But "supposed to" is doing a lot of work — there's no clear escalation rule, no measurement of override rate, and no one is responsible for the days when the model misbehaves. Six weeks in, agents have learned to ignore the suggestion. The tool is technically running. It isn't being used.

Place four: the wrong success metric. The pilot was measured on technical accuracy when leadership cared about response time. Or on response time when finance cared about cost per inquiry. The metric the pilot reports doesn't connect to the budget conversation in the room where the project's fate is decided. So it doesn't survive the budget conversation.

The four-question filter we run before any pilot

We don't start AI pilots that fail one of these. Not "we'd prefer not to" — we don't. The fastest way to get to a pilot that ships is to refuse pilots that won't.

1. Is the workflow stable enough to instrument?

If the underlying process is changing every other month — new fields in the form, a new escalation rule, a re-org of the team — the pilot will be measuring noise. We pause and stabilize the workflow first. Six weeks of doing nothing AI-related, in service of a pilot that actually has a baseline to beat.

2. Is the integration path real?

Before the pilot starts, we run a one-week integration spike: read access, write access, queueing, retries, monitoring. Production-quality, not sandbox-quality. If the integration path goes through three quarters of procurement negotiation with a vendor that has to approve the API call, we say so, and we hold the pilot until that's resolved.

3. Does someone own the override loop?

Every AI workflow in production needs a named person whose job, on a defined cadence, is to look at the cases the model got wrong, ask why, and decide whether the answer is more training data, a workflow change, or escalation rules. If we can't name that person, we don't start. The day-one accuracy of the model is the least interesting number; the rate of improvement over month one is what tells you the project will ship.

4. Does the success metric belong in the next budget review?

If the metric the pilot is going to report on isn't the metric leadership will discuss in the next quarterly review, we change the metric — not the leadership. The metric must matter to the people who decide whether the pilot becomes a product line. Anything else is a research project. Research projects are useful, but they aren't pilots.

What a shipped pilot looks like in month four

The shipped pilots in our portfolio have something in common, and it isn't the technology — they span GPT-4-class models, smaller fine-tuned classifiers, and one piece of plain rules engineering that everybody on the team has the grace to still call "the AI project." What they share is texture in month four.

In month four of a shipped pilot, three things are true:

Someone on the operational team — not the AI team — runs the weekly status. The pilot has been absorbed into operations.
The success metric is in the company's regular reporting, not in a parallel deck the AI sponsor built.
There's a backlog of next-best-uses. Once a pilot ships, demand for the next workflow opens up. If month four still has no backlog, the pilot probably hasn't really shipped — it's still a curiosity.

The honest line

Most companies I talk to don't need more AI experiments. They need fewer, better-chosen ones. The handful of workflows that absolutely justify AI — and they exist in almost every company, regardless of size — are usually obvious to anyone who's been on the floor. The trouble isn't finding them. The trouble is the discipline of saying no to the rest.

If we agree on that, the rest is mostly logistics. Which is the boring news, and also the only honest one.

The case for AI pilots that actually ship.