Skip to main content

56% of Organizations Get Nothing from AI Because Compatibility Was an Afterthought

The consulting industry has spent two years producing reports on why AI investments fail. The data is consistent and substantial. What the reports lack is a theoretical frame that explains the pattern.

The numbers arriving from every major consulting firm are converging on the same uncomfortable finding. PwC's 29th Global CEO Survey, polling over 4,400 executives worldwide, found that 56% of organizations reported neither increased revenue nor reduced costs from their AI deployments over the prior twelve months (PwC, 2026). CEO confidence in revenue growth hit a five-year low, even as AI investment continued to rise (Reuters, 2026). MIT researchers documented a 95% failure rate for Generative AI pilots across enterprise settings (Fortune, 2025). Deloitte's dedicated analysis of the AI ROI question confirmed the pattern: investment is increasing while measurable returns remain elusive for the majority, with genuine P&L impact concentrated in a small minority of organizations (Deloitte, 2025). McKinsey's State of AI survey reinforced that the performance gap between leading and lagging firms is widening, and that the differentiating factor is organizational execution rather than model selection (McKinsey, 2025). The World Economic Forum's synthesis of CEO perspectives heading into 2026 described organizations caught between the pressure to invest and the inability to demonstrate returns (WEF, 2026).

Only 12% of organizations, a group PwC labels "Vanguard" firms, were achieving simultaneous revenue gains and cost reductions (PwC, 2026). The rest are, in one form or another, paying for a technology that is not returning the investment.

The explanations offered across these reports are varied. Fragmented data architectures, where organizations attempt to run advanced reasoning models on inconsistent, siloed records. Isolated departmental pilots that fail to connect with core business systems and generate no enterprise-scale return. Insufficient leadership literacy. Work intensification, documented in an eight-month ethnographic study by Ye and Ranganathan at UC Berkeley Haas, where AI tools were expanding task scopes, dissolving natural pauses in the workday, and introducing parallelism that increased cognitive load rather than reducing it. Each of these explanations captures something real.

What none of them adequately explains is the mechanism. Why do technically capable tools, deployed in well-resourced organizations with genuine executive commitment, produce so little measurable return? My empirical research in software engineering offers a precise answer: compatibility. Specifically, the degree to which a Generative AI tool fits within the existing workflows, technical environment, and working practices of the people expected to use it. When that fit is absent, neither the tool's capability nor the organization's commitment can substitute for it.

What the Reports Are Actually Describing

Before turning to the empirical evidence, it is worth dwelling on what the consulting data is showing in concrete terms, because the ROI crisis is not uniform. PwC's analysis draws a sharp distinction between what it calls the "Vanguard" minority and the majority. The Vanguard firms share three characteristics: enterprise-scale deployment rather than isolated pilots, clean and integrated data environments that give AI tools access to the information they need to be useful, and cross-functional collaboration that prevents AI initiatives from being captured by a single department (PwC, 2026). The PwC analysts are essentially describing compatibility conditions without using the term.

Deloitte's framing of the "paradox" is instructive in a similar way. Investment is rising, in some sectors doubling, while EBITDA impact remains elusive. The natural interpretation is that organizations are buying more of something that is not working. But the Deloitte analysis points toward a subtler problem: organizations are deploying tools that generate activity without generating value, because the tools have not been integrated into the workflows where value actually originates (Deloitte, 2025). McKinsey's data confirms that high performers are distinguished by their ability to scale AI beyond "pockets of innovation" into systematic workflow integration (McKinsey, 2025).

The MIT pilot failure data is the sharpest expression of this pattern. A 95% failure rate is not a technology problem. It is a deployment problem. The pilots are failing not because the models produce inadequate outputs in controlled conditions, but because those outputs cannot be absorbed into the work as it is actually structured. The tool and the workflow are not compatible, and no amount of prompting strategy resolves a structural integration gap.

The Empirical Evidence: Compatibility Drives Adoption, Not Usefulness

My research published in ACM Transactions on Software Engineering and Methodology was not designed to comment on the ROI crisis directly. I was studying the factors that drive software engineers to adopt Generative AI tools. But the findings map directly onto the pattern the consulting reports are describing, and they offer a precision the reports lack (Russo, 2024).

I conducted a mixed-methods investigation: a qualitative study with 100 software engineers followed by quantitative validation using Partial Least Squares Structural Equation Modeling on data from 183 practitioners drawn from 27 countries. The theoretical framework I developed, the Human-AI Collaboration and Adaptation Framework (HACAF), integrated the Technology Acceptance Model, Diffusion of Innovation Theory, and Social Cognitive Theory. The central question was: what actually drives a software engineer to adopt an LLM tool?

The conventional answer, grounded in the Technology Acceptance Model, is perceived usefulness. If you believe the tool helps you, you use it. Most organizational AI strategies rest on this assumption. My data contradicted it. The direct path from technology perceptions to adoption intention was not statistically significant (path coefficient 0.155, p = 0.204). Perceived usefulness, ease of use, and relative advantage, taken together, did not predict whether engineers would adopt these tools.

What did predict adoption, with substantial and statistically robust explanatory power, was compatibility.

Compatibility Factors are the most important element to improve the Intention to Use. They explain, almost single-handedly, the adoption of AI tools with a very high R² close to 50%.

Russo (2024), ACM Transactions on Software Engineering and Methodology

The path coefficient from compatibility to adoption intention was 0.536, with a large effect size (f² = 0.66). The full model explained 48.9% of the variance in adoption intent. The Importance-Performance Map Analysis confirmed that compatibility carries the highest importance score of any construct in the model (total effect: 0.646). Social factors (peer influence and self-efficacy) showed no significant direct path to adoption (p = 0.170). Organizational support and personal innovativeness were also non-significant (p = 0.949).

The theoretical implication is clear: an engineer who believes a tool is useful, who works in a supportive organization, and who is surrounded by colleagues who encourage adoption will still not meaningfully adopt that tool if it does not fit the way they work. This is the mechanism the consulting reports are observing without naming.

What Compatibility Means in Practice

In the qualitative phase of my research, I asked 100 software engineers directly about the compatibility of LLMs with their current work. Four aggregate dimensions emerged.

The most frequently cited, by 39% of respondents, was improved efficiency: the sense that the tool speeds up existing tasks rather than requiring engineers to rebuild their working habits around a new paradigm. A further 28% described the value of AI assistance in situations where traditional search methods fell short, particularly for code discovery and documentation. Sixteen percent cited similarity to current practices, characterizing LLM use as an incremental extension of familiar habits. One participant put it simply: "It's like Googling but I don't need to filter as much information." Eleven percent identified adaptation and learning as the dominant dimension, acknowledging that the tool represents a genuinely new paradigm that requires time and investment before it feels natural.

The complexity data is equally instructive. Among the barriers engineers cited, one participant noted: "Might not be compatible with the systems at my workplace." Another observed that LLMs are "nice for generic tasks, but the models have zero knowledge about our internal APIs so they're really hard to apply." These are not complaints about model quality. They are descriptions of integration gaps that no prompting strategy can bridge.

This is the same gap PwC identifies when it distinguishes Vanguard firms by their integrated data environments. The firms that are generating returns have made the infrastructure investments that allow AI tools to access the information that makes them genuinely useful for specific work. The majority have not, and the tools remain generically capable in contexts where specific capability is what produces value.

I also found that compatibility failures compound with experience heterogeneity. Fifty-four percent of respondents in my study described ease of use as something that develops through practice, while 26% identified individual background, including prior programming experience and familiarity with NLP concepts, as significant determinants of perceived fit. A deployment that treats all developers as equally prepared will systematically underperform, because the compatibility gap between the tool and the person varies substantially by career stage and cognitive style.

Three Places Where Compatibility Gets Lost

The consulting reports and my research converge on three distinct levels where the fit between AI tools and organizational reality tends to break down.

System fit refers to integration with the existing technical stack. PwC's Vanguard firms are characterized precisely by having solved this problem: clean data environments, API access, and tooling that allows AI to connect to the information it needs to be useful for actual tasks. The 95% pilot failure rate documented by MIT reflects, in significant part, the inverse condition: technically capable tools deployed into environments where they cannot access the context that would make them valuable. Generic capability deployed against context-specific work produces generic output, which developers quickly learn to discard.

Flow fit addresses whether the tool sustains or disrupts a developer's cognitive state. The UC Berkeley Haas research by Ye and Ranganathan identified three forms of work intensification: expanded scope, dissolved pauses, and parallelism. Each represents a form of flow disruption. Deloitte's analysis of the ROI paradox is consistent with this: organizations reporting the highest AI investment levels are not reporting proportionally higher returns, because the human cost of adaptation is consuming the efficiency gains before they reach the P&L (Deloitte, 2025). A tool that demands significant prompting overhead to produce usable output, or one whose output review cycle consumes more time than the generation saved, has negative flow fit for that task type.

People fit captures the match between the individual engineer and the tool. In my research, I found that 57% of respondents placed significant importance on being seen as someone who uses cutting-edge technology, a form of self-efficacy pressure that can drive superficial adoption without genuine workflow integration. Meanwhile, 35% prioritized practicality and efficiency over novelty, indicating they would adopt a tool only when it demonstrably served their specific tasks. McKinsey's data shows that high-performing organizations differentiate deployment approaches by role and experience level; they do not apply a uniform adoption mandate and expect uniform results (McKinsey, 2025).

The Compatibility Audit: A 30-Minute Drill for Your Team

Before authorizing the next AI investment or expanding a current deployment, a structured compatibility assessment can prevent the pilot failure pattern from repeating. Run through the following as a team in a single session.

  1. Map the workflow first. Document the five to seven recurring tasks your team performs most often. For each, identify the current toolchain, the expected output format, and the review steps involved.

  2. Test system fit against actual tasks. Can the AI tool access the data it needs for your specific work, including internal APIs, proprietary libraries, or domain-specific documentation? If the answer requires significant workarounds, record the integration cost before measuring any productivity gain.

  3. Run a flow disruption count. Have three developers use the tool during a standard sprint. Ask each to note any moment where the tool created friction, required a context switch, or produced output that needed substantial correction. Count these events across the sprint and compare against the time saved.

  4. Assess experience heterogeneity. Identify the range of seniority and prior AI experience on your team. Has the deployment plan differentiated the onboarding approach for junior developers versus those with a decade of experience? My data indicates this differentiation is not optional: it is predictive of adoption outcome.

  5. Audit the learning curve assumption. My research found that 54% of software engineers describe LLM ease of use as something that develops through practice. Is there allocated time for this in the sprint cadence, or is immediate productivity being assumed?

  6. Check organizational stance clarity. In my study, 42% of respondents worked in organizations with a clearly supportive attitude toward LLMs, while 15% reported active opposition and 20% reported a neutral stance that left the decision to individual discretion. Ambiguity in organizational stance depresses adoption regardless of tool quality.

  7. Evaluate the output review cost. The time saved on generation must exceed the time spent on verification for a net compatibility gain. Test this for each task type rather than assuming the balance holds uniformly.

  8. Score similarity to current practices. For each task type, ask developers whether using the tool feels like an extension of existing habits or a replacement of them. Tools that extend generate organic adoption. Tools that replace generate sustained resistance.

Next Moves

For The Builder

Maintain a compatibility log for the first four weeks of any new AI tool deployment. Record which task types produce clean, usable outputs on the first pass, which require significant prompting iteration, and which produce outputs that are structurally incompatible with your downstream review process. This data is the basis for an evidence-based adoption recommendation, and it is far more valuable than any benchmark score the vendor has provided.

When a tool consistently fails on tasks involving your internal systems or proprietary context, raise the integration gap with your team lead. Do not compensate by constructing elaborate prompting strategies for a tool that lacks access to the information it needs. My research makes clear that this compatibility failure cannot be engineered around at the individual level.

Track your own learning curve trajectory. If you are not observing a meaningful reduction in prompting friction after four weeks of consistent use, this is diagnostic information about tool-workflow alignment, not a reflection of your ability to adopt new technology.

For The Manager

Stop using adoption rates as a proxy for AI value. A team that has uniformly adopted a tool producing consistent flow disruption is not outperforming a team that has selectively adopted a tool for the specific tasks where the fit is strong. Measure time-to-first-usable-output and review-iteration counts per task type, and track these over time.

Differentiate deployment expectations by developer experience level. My findings show that individual background substantially shapes how quickly an engineer can work effectively with LLMs. Pairing junior developers with more experienced practitioners during the initial adoption period, specifically to share workflow-compatible prompting strategies, reduces the learning curve and mitigates the risk of the complacency dynamic I observed in 16% of respondents, who worried that junior engineers were losing foundational skills by over-relying on AI output.

Create structured space for the team to surface compatibility failures. If developers are routinely encountering mismatches between AI output format and downstream review requirements, this needs to reach engineering leadership before it becomes an invisible drag on velocity that no sprint metric captures.

For The Roadmap Owner

Reframe your AI evaluation criteria to lead with compatibility assessment rather than model benchmarks. PwC's Vanguard firms share integrated data environments and systematic deployment, not superior model selection (PwC, 2026). My research explains why: nearly 50% of adoption intention variance is explained by compatibility factors alone, with technology perceptions showing no direct path to adoption. Procurement that selects on benchmark scores is optimizing for a variable that does not drive adoption, and without adoption, there is no return.

Before authorizing full deployment, commission an assessment against the three compatibility dimensions (system fit, flow fit, and people fit) for the specific teams and task mixes involved. The 95% pilot failure rate is not primarily a consequence of model underperformance. It is a consequence of skipping this assessment.

Allocate budget explicitly for integration infrastructure. API access layers, internal documentation indexing, and output format alignment are investments that appear nowhere in most AI vendor proposals and prominently in the practices of the firms generating real returns. These investments directly improve compatibility factors, which my data identifies as the primary driver of adoption intent.

A Question Worth Taking Back

The consulting reports have accurately mapped the scale of the AI ROI crisis. What they have been slower to provide is a mechanism. My research suggests the mechanism is compatibility: the structural fit between a tool and the workflow, stack, and people where it is deployed. When that fit is absent, capability does not produce adoption, and adoption does not produce return.

The organizations generating real returns have not found a better model. They have achieved a better fit.

Daniel Russo, Ph.D., is a Professor of Software Engineering whose research examines the intersection of human cognition and artificial intelligence. Through "Software Insights," he translates empirical research into actionable guidance for software practitioners and organizations.

Partner with Daniel to transform your organization through evidence-based approaches that bridge academic rigor with practical implementation. His consulting work helps organizations to adopt scientifically validated practices that improve software development outcomes, team performance, and innovation capacity.

Learn more about his approach to evidence-based organizational change: https://www.danielrusso.org/evidence-based-organizational-change/ (Opens in a new window)

References

Deloitte. (2025). AI ROI: The paradox of rising investment and elusive returns. Deloitte Insights.

Fortune. (2025). MIT report: 95% of generative AI pilots at companies are failing. Fortune.

McKinsey & Company. (2025). The state of AI: Global survey 2025. McKinsey Quarterly.

PwC. (2026). 29th Global CEO Survey: Leading through uncertainty in the age of AI. PricewaterhouseCoopers.

PwC. (2026). CEOs, to get real value from AI, put the right foundations in place. The Leadership Agenda.

Reuters. (2026). CEO revenue confidence hits 5-year low: PwC survey. Reuters.

Russo, D. (2024). Navigating the complexity of Generative AI adoption in software engineering. ACM Transactions on Software Engineering and Methodology, 33(5), Article 135.

World Economic Forum. (2026). Reinvention in a fragmenting world: What CEOs are saying (and need to know) in 2026. World Economic Forum.

Ye, X. M., & Ranganathan, A. (2025). Work intensification and the vicious cycle of AI-assisted productivity. UC Berkeley Haas School of Business.