There is a quiet pattern playing out across enterprise IT departments right now. A vendor demo lands well. A pilot is funded. Six months later, the pilot is "still in evaluation." Eighteen months later, it has been quietly rolled into next year's budget for "phase two."
The technology is not the problem. The problem is that production AI in a large organization is a fundamentally different artifact from a vendor demo, and most teams underinvest in the gap between the two.
Pilots optimize for the wrong thing
A pilot's job is to prove the model can produce useful outputs on representative inputs. A production system's job is to produce useful outputs on every input, every day, with predictable cost, observable behavior, and graceful failure modes. Those are not the same problem.
The pilot, in our experience, takes about 15% of the total work. The remaining 85% is the boring part: data pipelines, evaluation harnesses, change management, integration with the four legacy systems nobody wants to touch, and a fight with InfoSec over where the embeddings are stored.
Teams that succeed plan for the 85% from week one. Teams that fail are still surprised by it in month nine.
Eval before model
The single most reliable predictor of whether an enterprise AI project ships is whether the team built an evaluation harness before they picked a model. We have stopped doing engagements where the client wants to "just try GPT first and figure out evals later." Without an eval, you cannot tell if the model is getting better. Without that signal, every release becomes an argument.
A good eval is small, opinionated, and owned by the people who care about the outcome. Twenty examples, hand-graded by a domain expert, beat two thousand auto-generated examples graded by a weaker model. The eval is the contract.
Build versus buy is the wrong frame
The interesting question is not "build or buy." It is "where in the stack do we want to differentiate?" Most teams should buy infrastructure (inference, vector DBs, observability) and build the parts that touch their domain (the eval, the data, the workflow). Inverting that — building infrastructure, buying domain logic — is how organizations end up with both a six-figure inference bill and a chatbot their customers find useless.
Agentic systems are not magic
The agent narrative has gotten ahead of the agent reality. Production agents that run unattended over real data are a research problem, not a checkbox feature. The teams getting value from agentic patterns are the ones using them as a careful augmentation of a human workflow — escalating to a person at every uncertain step, with full audit trails — not as autonomous systems.
If your agent demo includes the phrase "and then it just figures out what to do," budget for at least six months of additional engineering before that becomes true in production.
Governance is a release schedule
The mistake is to treat governance as a separate workstream. The teams that ship treat it as part of the release process: every model change goes through the eval, every prompt change is versioned, every deployment is observable. Governance becomes a tooling problem, not a committee.
A useful test: if your security team had to explain how your AI system makes a single specific decision, could you do it in under thirty minutes? If the answer is no, you have a governance gap, regardless of how many policies are written down.
The enterprises shipping AI well in 2026 look more like the ones who shipped microservices well in 2018: small teams, fast iteration, deep ownership, boring infrastructure underneath.