Generative AI

Why 95% of GenAI pilots fail and what you can do differently

Seven causes from the replication-crisis diagnosis and seven concrete countermeasures for CFOs, COOs and CDOs.

Guido Winger

10 min read

In August 2025 the MIT NANDA programme published the report State of AI in Business 2025. Despite 30 to 40 billion US dollars in enterprise spending on generative AI, 95% of organisations see no measurable P&L effect. Only 5% of pilots reach the promised revenue acceleration (Fortune, 18 Aug 2025).

A year earlier, on 29 July 2024, Gartner had predicted: at least 30% of all GenAI projects will be abandoned after the proof of concept by the end of 2025, due to poor data quality, missing risk controls, escalating costs or unclear business value (Gartner press release).

Both numbers trace back to a common cause: the GenAI wave repeats the methodological mistakes of the replication crisis in academic research - only this time with quarterly budgets and boardroom promises instead of peer review.

1 · The seven methodological causes

From the analysis of 300 publicly documented AI deployments in the MIT NANDA report and from our own hands-on observation in mid-cap implementations, we identify seven recurring causes of PoC failure.

1.1 Claim inflation in the preparation phase

What the PoC proposal says: "We raise productivity by 40% through GenAI coding assistants." What is measurable after six months: a hard-to-interpret mix of tool adoption, task shifting and Hawthorne effects. Nobody can tell whether the productivity numbers come from using the tool or from the increased attention on the metrics.

What you do differently: before the PoC, define a written, pre-specified success endpoint. Not "we improve X by Y%", but "we measure Z on population W between date A and B, against comparison population V".

1.2 Data-quality assumptions without an audit

What the PoC assumes: "Our CRM data is clean enough for an LLM-RAG system." What is actually the case: 30-40% of the CRM records are outdated, redundant or faulty. The GenAI system reproduces the data-quality problems as "hallucinations", and the PoC fails because the answers are wrong.

What you do differently: before the PoC, commission a data audit that is not carried out by the same team that proposes the PoC. A written data-quality statement including the error rate.

1.3 Comparison against a weak baseline

What the PoC compares: "GenAI assistant vs. no support". What should be compared in real operation: "GenAI assistant vs. existing tools". Whoever tests GenAI against "no tool" has an artificially low baseline and an overstated benefit.

What you do differently: the baseline is always the current best-practice state, not the zero state. If you work today with a classical ML model or a rule-based heuristic, that is the baseline.

1.4 Success metric not translated into P&L

What the PoC team presentation shows: "BLEU score 0.82", "faithfulness 0.91", "user satisfaction 4.2 out of 5". What the CFO needs: euros per quarter, cash-flow impact, OPEX reduction. The gap between the ML metric and the P&L metric is rarely closed.

What you do differently: before the PoC, a written translation table: which ML metric corresponds to which euro impact? Without this table, no PoC success is defensible in a board meeting.

1.5 Total-cost-of-ownership blindness

What the PoC calculates: the licence cost of the LLM provider. What surfaces after six months in production: token costs scale linearly with usage, the compliance overhead of GDPR mapping doubles the compliance team, latency SLAs require multi-region deployments, the RAG setup demands a vector-database licence.

What you do differently: TCO modelling over twelve months with three usage scenarios (low, medium, high), including compliance, infrastructure and personnel overhead.

1.6 No failure-mode analysis

What does not happen in the PoC: a systematic analysis of in which situations the GenAI system answers incorrectly, what economic damage a wrong answer pattern causes, and how the failure mode is detected in production.

What you do differently: a pre-mortem instead of a post-mortem. Before go-live, identify at least three realistic failure modes, each with a detection mechanism and an escalation path.

1.7 No independent review

What usually does not take place: an external person outside the PoC team checks the success claim. What does take place: the team that ran the PoC writes the final report - with an incentive toward a positive result.

What you do differently: put in writing, in advance, who performs the independent review. This person is not in the same reporting line as the PoC team and holds a veto over the success report.

2 · What the common cause is

All seven points above share the same replication-crisis background: a methodological error in the preparation that biases the result without anyone on the team noticing, because the team shares the same methodological blind spot.

In academic research this mechanism has been documented since Ioannidis 2005 (Why Most Published Research Findings Are False) and dissected methodologically since Gelman/Loken 2014 as the "Garden of Forking Paths".

The transfer into GenAI pilot practice is direct: a PoC team has structurally the same incentives as an academic to produce a positive result, and the same methodological tools are missing.

→ How myBytes itself works against this: the Truth-Check Protocol

3 · Practical consequence for your Q3-Q4 2026 AI investment

If you face a GenAI pilot decision today, three upfront questions matter more than the tool selection:

What is the pre-specified success endpoint? In writing, before the PoC starts. If the answer is "we will define that during the pilot", the probability of PoC failure is high.
Who is the independent reviewer? Outside the PoC team line. With a veto over the success report.
Which TCO model applies? Three usage scenarios, a twelve-month horizon, all overhead components.

If these three questions cannot be answered within two weeks, you should postpone the PoC - not switch the provider.

4 · The strongest counter-position: "pilots are exploratory, 95% failure is normal"

A plausible counter-position: "Pilots are exploratory risk investments. A 95% failure rate is industry-standard outside GenAI too. Whoever has a 5% hit rate has structurally enough success."

Three answers:

30-40 billion dollars at a 5% hit rate. Even with industry-standard dispersion, the money flowing into the 95% failed pilots consumes liquidity that could be deployed more productively elsewhere.
Methodologically improvable. The seven causes above are all improvable. A PoC that avoids the methodological errors has an empirically higher success rate.
Reputational risk on repeated failure. Three abandoned PoCs in a row erode the internal mandate for the next AI investment. That is a real damage to a career and to the next opportunity.

5 · What we do differently at myBytes

myBytes is small and does not work in the GenAI hype. We build classical ML and geo pipelines (forecasting, EUDR risk classification, demand planning) that are methodologically checked against the Truth-Check Protocol. Every statement about our models ships with a companion repository on GitHub. On the first run of the notebook, an assert block checks whether the values cited in the article reproduce exactly from the snapshot.

This is not an industry standard, and it is slower than a normal PoC. But it is the only methodology that does not hit the wall after six months because of one of the seven causes above.

→ The Truth-Check Protocol in depth

6 · What this article does not cover

Concrete tool recommendations. We name no specific LLM providers or MLOps tools, because the tool choice depends on your data architecture and your compliance position.
GenAI-specific data-protection questions. A separate discussion of GDPR compliance with external LLM APIs follows in a separate article.
Sector-specific ROI examples. We publish no sector benchmark numbers that do not stem from our own implementation practice.

7 · Reading list

MIT NANDA, State of AI in Business 2025 - the study behind the 95% figure.
Fortune, 18 Aug 2025, MIT report 95% of GenAI pilots - the journalistic write-up.
Gartner press release, 29 Jul 2024 - the 30% PoC-abandonment forecast.
Ioannidis 2005, Why Most Published Research Findings Are False - the original methodological paper.
Gelman/Loken 2014, The Garden of Forking Paths - the mechanical follow-up paper.

A Truth-Check Protocol for AI research output - the in-depth methodological article this one builds on.