Field Notes on Workflow & AI

Seven failure modes of AI integration projects.

A field-tested taxonomy of how well-intentioned AI programs go sideways — and a frame for telling project failures apart from organizational ones.

A field note from Moschetti Consulting ~9 min read

Most AI integration projects that fail do not fail because the model was wrong. They fail for reasons that are visible, in retrospect, months before any model is trained — and which were, in many cases, visible before the project was even approved. What follows is a working taxonomy of the seven most common ones. We have seen each of them more than once.

The modes are ordered roughly from the most conceptual (what leadership believes about AI) to the most operational (how the project was scoped and staffed). You will likely recognize three or four of them in any organization you know well. That is the point: these are not exotic failures. They are the default failures, and avoiding them requires deliberate effort.

Mode 01 AI does not mean Absurdly Intuitive.

The single most common expectation gap — and therefore the single most common source of disappointment — is the belief that a large model, given enough organizational information, can produce rational and accurate insights across a broad range of questions posed to it.

This belief is pervasive at the executive level because it is how the technology is marketed and because casual use of consumer AI reinforces it. The model produces a coherent answer to almost anything. The answers feel authoritative. The leap from "coherent and authoritative" to "rational and accurate in my business context" is a leap the marketing materials rarely trouble to clarify.

The practical consequence is that projects get scoped against the intuitive version of AI rather than the actual one. Ambiguous questions are expected to yield unambiguous answers. Heterogeneous data is expected to be self-reconciling. Context that would take a human analyst a week to assemble is expected to be assembled by the model on demand. None of this is impossible, but all of it requires a great deal of upstream work that the "just feed it everything" framing actively discourages.

When the project underdelivers, the diagnosis is often that "the model isn't good enough yet." The actual diagnosis is usually that the question was never answerable in the form it was asked.

Mode 02 Machine speed reveals structural problems that were always there.

There is a reason nobody drives a Volkswagen Beetle at 150 miles per hour, even with a Porsche turbo engine in the back. At 50 mph, everything else about the car works fine: the brakes, the steering, the suspension, the tire grip, the aerodynamics. At 150 mph, you are asking for trouble — not because the engine is wrong, but because nothing else was built to operate at that speed.

Business processes are the same. The existing reconciliation cadence, the approval workflow, the exception-handling convention, the downstream reporting lag — all of it was calibrated, often unconsciously, to the speed at which humans produce work. When AI agents begin operating at machine speed in one part of the process, everything around them starts shaking in ways the organization did not anticipate and is not equipped to diagnose.

Queues that used to drain overnight now fill faster than downstream consumers can process. Daily reconciliations become inadequate and have to move intraday. Error batches that used to be handled by one analyst in the morning arrive in volumes that overwhelm the handling process. The AI is working exactly as designed. The rest of the car is coming apart.

The lesson: do not introduce machine speed in one place without explicitly modeling what happens to everything connected to it. Most integration plans skip this step, because it requires admitting the diagnosis we gave in our first field note — that the map isn't the territory, and the connected systems are not as clean as the diagram suggested.

Mode 03 Misusing, misunderstanding, and misappropriating metrics.

Every AI integration project has metrics. Most of the metrics are wrong. Not maliciously wrong — just calibrated to measure something other than the thing the organization actually cares about.

The classic case is ticket closure rate. A team introduces AI-assisted triage and reports that closure rate climbs 40 percent. What has actually happened is that one complex ticket that used to require substantive resolution is now broken into ten smaller tickets, each piecemealing a partial solution. Closure rate goes up. Customer satisfaction goes down. Nobody on the project dashboard sees the second number, because the second number was not chosen as a metric.

The pattern generalizes. Whatever metric you choose, your AI system will optimize for it with extraordinary efficiency — and unlike a human team, it will do so without any countervailing sense that something is off. Humans have an internal alarm when their work starts to feel like theater. Models do not.

The defense against this is not better metrics. Better metrics get gamed too. The defense is a discipline of measuring the thing that actually matters — often a lagging, qualitative outcome — and accepting that the leading indicators will always be partial proxies that have to be watched with suspicion. This discipline is uncomfortable because it makes project reporting harder. It is also the only reliable way to know whether the integration is working.

Mode 04 Confusing soft low-risk processes with hard high-risk ones.

AI performs spectacularly well in a specific class of work: suggesting options, adapting to user behavior, generating candidate responses, capturing reactions into reinforcing feedback loops. These activities are generative in nature. They tolerate — even benefit from — variance in output. If the recommendation is slightly off, the user adjusts, and the system learns.

This is not end-to-end transaction management. Regulatory filings do not tolerate variance. Trade settlements do not tolerate variance. Benefit determinations, legal disclosures, medication dosing, and payment releases do not tolerate variance. These are hard processes — they have right and wrong answers, they have audit trails, they have regulators, and the cost of being wrong is categorically different from the cost of suggesting a suboptimal playlist.

The failure mode is not that organizations deploy AI in hard processes. It is that they deploy AI in hard processes while still imagining the soft-process risk profile: tolerant of variance, quick to iterate, forgiving of error. The two profiles are not interchangeable, and the governance required for each is different in kind, not degree.

Mode 05 Introducing AI at the core when it belongs at the edge.

A closely related failure, and one of the most consequential. AI is most useful, and most safely deployed, where the data is well-structured and the activity is somewhat peripheral to the core transaction fabric of the business. At the edge, the blast radius of a mistake is small. The tree, if shaken, does not drop much fruit.

Introducing AI — particularly autonomous AI operating at machine speed — at the core of the operation is a different project entirely. Core systems are where the business's money, identity, and obligations live. The data is often messier than the edge precisely because it has been accreting for decades. The downstream dependencies are more numerous, the audit requirements are more stringent, and the organizational politics are an order of magnitude more complex.

Organizations that succeed at core AI integration almost always got there by succeeding at the edge first and earning the license to move inward. Organizations that start at the core, attracted by the scale of the potential prize, typically spend eighteen months discovering the actual meaning of everything we have been discussing — and in the worst cases, they discover it the way the cloud-migration projects of a decade ago discovered unsecured FTP servers: expensively, publicly, and in a form that creates a new risk committee.

Mode 06 Claiming benefits that haven't been defined.

This one may not be a failure in the strictest sense — the project may still deliver something. But it is one of the most common reasons that apparently successful AI projects produce unresolvable disagreements about whether they worked. The root cause is almost always that nobody pinned down, in writing, what "benefit" meant.

Three categories of benefit get conflated constantly. They mean different things, they are measured differently, and they have different implications for how the organization must change:

Hard save: Less money is spent on people or equipment than was spent before. The headcount falls. The contract is cancelled. The license count drops. A real dollar figure comes out of a real line in the budget. Full stop.
Hard avoid: No additional money had to be spent to perform a new function or provide a new capability. Nothing was saved — existing spend is unchanged — but a cost that would have been incurred is not.
Efficiency gain: The same work is done faster, or with fewer errors, or more work is done for the same money, or the same work is done with fewer people in area X₁ but more people in area X₂. Murky by nature. Usually requires a definition before the project starts, not after.

The failure mode is not that organizations pursue any one of these. It is that the project sponsor, the finance partner, and the operational owner each have a different category in mind, and this fact is not discovered until the project is being evaluated in its final quarter. The CFO is looking for a hard save. Operations is celebrating an efficiency gain. The project team is measuring an avoid. Everyone is right. Nobody is happy.

The remediation is unglamorous: insist that the benefit category be named at project approval, in a single short paragraph, in language that a skeptical finance reviewer would sign. Projects that cannot produce this paragraph should probably not be approved yet.

Mode 07 Treating the pilot as proof.

The seventh failure is the one that surprises people, because by every normal standard the project was succeeding right up until it collapsed. A pilot was run. The pilot produced clean results. The numbers justified a production rollout. The rollout began, and within six months the benefits had evaporated.

The reason is almost always the same. Pilots succeed because they enjoy three conditions that production never enjoys: a curated slice of data, a motivated team, and an unusually clean portion of the workflow. The curation is often unconscious — the pilot naturally runs on the easy customers, the typical cases, the happy-path transactions, because those are the ones available in the time box. The team is staffed with people who volunteered or were hand-picked. The workflow slice is typically the one with the cleanest documentation, which is to say, the one whose real complexity has been best obscured by its nice appearance.

Production has none of these luxuries. The data includes every exception, every legacy case, every accumulation of corrections that were never properly integrated. The team is whoever happens to be staffed to the function, not whoever was motivated enough to join a pilot. The workflow is the entire workflow, including the parts the pilot deliberately avoided.

The practical defense is to design the pilot to disprove rather than to prove. Include the messy cases. Include the people who weren't volunteers. Include the parts of the workflow the team would have rather set aside. A pilot that succeeds under those conditions is evidence. A pilot that succeeds on the easy slice is a sales tool.

A closing frame Is this a project failure, or an organizational one?

All seven modes look identical from the outside. The deliverables miss. The timeline slips. The benefits don't materialize. The post-mortem is scheduled. But the seven modes are not all the same kind of failure, and the distinction matters because the response required is different.

Some failures are project failures — things went wrong inside the scope and authority of the project itself. Others are organizational failures — the project was never going to succeed given the conditions around it, and no amount of project discipline would have saved it. Mistaking the second kind for the first is why so many organizations run the same failed project twice with a different vendor.

Project failure

Market, HR, or regulatory conditions shifted and the motivating case for the project evaporated (e.g., the target market was exited, an acquisition re-prioritized, a regulation changed).
The technology itself failed — it did not scale, a security gap was discovered late, a vendor ceased operations, a new approach arrived that obsoleted the chosen one.
Execution was substandard: staffing, sequencing, testing, or delivery discipline was not adequate to the scope.

Organizational failure

Economic, market, staffing, or flexibility goals were underspecified — or, more often, deliberately kept unclear because clarity would have surfaced disagreements the sponsor did not want to have.
The project's technical scope was sound, but integration points to the rest of the organizational footprint were underspecified or ignored, so the benefits never integrated with anything.
Senior authority was either insufficient or uninterested in actualizing the hard consequences: the headcount reduction, the budget shift, the reorganization, the closure of a duplicative function.

Notice the asymmetry. Project failures can be addressed by better project management, better vendors, better delivery. Organizational failures cannot — they can only be addressed by someone senior enough to change the conditions around the project. A good consultant's first job is to figure out which kind of failure is being set up, and to say so, clearly, before the budget is approved.

The seven failure modes above mostly lean one way or the other. Modes 01–03 are typically project failures, resolvable within the project's own authority if they are noticed in time. Modes 04–07 are almost always organizational — they require authority that sits above the project to resolve, and pretending otherwise is the single largest source of wasted AI spending we have observed.

If you are reading this and recognize three or more of the modes in a program you are currently responsible for, we would suggest that the question is no longer whether the program will deliver as planned. The question is whether the authority required to redirect it is in the room.

— Moschetti Consulting

If one of these modes looks familiar, we'd welcome the conversation.

inquiries@moschetticonsulting.com