Dojo Labs Whitepaper

When Your AI Research Assistant Hallucinates the Data

How Consultants Are Shipping Wrong Numbers to Clients

Published March 202622 min readdojolabs.co
ConsultingStrategyMarket ResearchDue Diligence
Back to Case Studies & Papers

Executive Summary

Large language models have become the invisible research analyst in consulting engagements worldwide. From market sizing to competitive intelligence to due diligence, LLMs are generating the data points that drive multi-million dollar client decisions. Yet these models hallucinate at alarming rates when asked to produce factual numerical claims: independent testing reveals hallucination rates of 15-40% for factual data retrieval tasks. Benchmarks confirm the range: HaluEval (EMNLP 2023) measured ~19.5% hallucination across QA, dialogue, and summarization; OpenAI's SimpleQA benchmark (October 2024) showed GPT-4o achieved only 38.2% accuracy on simple factual questions; Stanford HAI (2025) found LLMs hallucinated 58-82% of the time on legal research queries; and the Vectara Hallucination Leaderboard (February 2026) places even the best models at 1.8-5% hallucination on grounded summarization tasks. Xu, Jain & Kankanhalli (NUS, arXiv 2024) formally proved that hallucination is mathematically impossible to eliminate in LLMs trained on finite data.

Our analysis of AI-generated consulting research found that 27% of numerical claims were verifiably incorrect and an additional 19% were unverifiable -- meaning 46% of all AI-generated data points in a typical consulting deliverable are either wrong or cannot be confirmed. These are not edge cases. They are the numbers being pasted into client slide decks, investment memos, and board presentations.

The consequences are already materializing: PE firms making investment decisions on non-existent market data, consulting recommendations driving workforce reductions based on fabricated benchmarks, and fundraising pitches collapsing when investors conduct independent verification.

The solution is a hybrid architecture that separates the language capabilities of LLMs from the factual retrieval and computation tasks where they fail. Every data point in a consulting deliverable must be retrieved, computed, and verified -- not generated.

15-40%

Hallucination rate in factual data retrieval tasks across leading LLMs (HaluEval EMNLP 2023: ~19.5%; OpenAI SimpleQA Oct 2024: GPT-4o at 38.2% accuracy; Stanford HAI 2025: 58-82% on legal queries; Vectara Feb 2026: best models 1.8-5% on grounded summarization)

27%

Of organizations review all AI-generated content (McKinsey H2 2024). Separately, npj Digital Medicine (Nature Publishing Group, 2025) found GPT-4 had a 1.47% hallucination rate and 3.45% omission rate across 12,999 clinician-annotated sentences, with 44% of hallucinations classified as major.

19%

Additional claims unverifiable -- plausible figures with no traceable source

$40M+

Potential write-down from a single investment based on hallucinated market data

01

The New Consulting Workflow

The consulting industry has undergone a quiet transformation. Where analysts once spent days combing through industry reports, SEC filings, and expert interviews to build a market sizing model, they now type a prompt into an LLM and receive a fully-formed answer in seconds. The AI has become the invisible analyst -- producing the data points that underpin strategic recommendations, investment theses, and operational decisions worth millions.

This shift happened faster than anyone anticipated. AI adoption for research tasks in consulting now exceeds 59-82% across major task categories. These figures are cross-validated against McKinsey 2025 (79% of organizations regularly using GenAI) and BCG October 2024 (98% of CxOs experimenting with AI, yet only 4% creating substantial value). But verification -- the step where someone checks whether the AI's output is actually true -- lags far behind, ranging from just 7% to 16%. The gap between adoption and verification is the most dangerous blind spot in modern consulting practice.

The core problem is that LLMs do not retrieve facts. They generate text that looks like facts. When an LLM produces “The global cybersecurity market is projected to reach $376.3B by 2029, growing at a CAGR of 13.4%,” it has not looked up that number. It has generated a sequence of tokens that are statistically likely to follow the prompt. The number may be close to a real estimate, wildly wrong, or entirely fabricated -- and there is no way to tell from the output alone.

Exhibit 1

AI Usage vs. Independent Verification by Consulting Task

Market Sizing

82%
AI Usage
14%
Validation

Competitive Analysis

76%
AI Usage
11%
Validation

Financial Modeling

71%
AI Usage
9%
Validation

Due Diligence Research

65%
AI Usage
16%
Validation

Benchmark Studies

59%
AI Usage
7%
Validation
AI Usage Rate
Independent Verification Rate

Exhibit 1B

Hallucination Risk by Entry Point

Risk severity assessment for common AI-assisted consulting research tasks

Entry PointRisk LevelFrequencyImpact
Primary Research SubstitutionCriticalVery HighSevere
Market SizingCriticalVery HighSevere
Competitive IntelligenceHighHighHigh
Growth ProjectionsHighHighHigh
Regulatory DataMedium-HighMediumHigh
Citation FabricationHighVery HighHigh
Risk Scale:
Critical
High
Medium-High

46%

Percentage of AI-generated financial research containing incorrect or unverifiable numerical claims

Exhibit 2B

AI-Generated Data Reliability Breakdown

46%

unreliable

27% Verifiably Incorrect

Numerical claims proven wrong against primary sources

19% Unverifiable

Plausible figures with no traceable source

54% Accurate

Confirmed against independent primary sources

02

Anatomy of a Hallucinated Statistic

Not all hallucinations are created equal. Through systematic testing of LLM outputs across hundreds of consulting-style research queries, we have identified five distinct categories of numerical hallucination. Each represents a different failure mode with different risk profiles and different implications for how consulting teams should handle AI-generated data.

Understanding these categories is critical because each requires a different verification approach. A fabricated market size figure requires different checking than a misattributed survey statistic or a hallucinated competitor revenue number.

Exhibit 2

Five Categories of Numerical Hallucination in Consulting Research

1

Market Size Figures

LLMs produce authoritative-sounding TAM estimates with precise dollar amounts and year targets that have no single verifiable source.

2

CAGR / Growth Projections

Compound annual growth rates are generated as single-point values when the underlying published estimates span a 15-20 percentage point range.

3

Survey Statistics

Models cite specific surveys with exact sample sizes and findings. In many cases, the cited survey does not exist or the findings are fabricated.

4

Competitor Financials

Revenue, headcount, and profitability figures for private companies are generated with false precision when no public disclosure exists.

5

Regulatory Statistics

Compliance adoption rates, certification percentages, and enforcement statistics are fabricated for regulatory frameworks that publish no such data.

Exhibit 3

What the AI Said vs. Reality

CategoryLLM OutputVerified RealityRisk
Market Size$3.9B by 2028Range $1.5B–$7.2B, no consensusCritical
Growth Rate32.8% CAGRPublished estimates 20–39%High
Survey DataSurvey of 1,200 CFOsNo such survey existsCritical
Competitor Revenue$47M annual revenuePrivate company, undisclosedHigh
Compliance Rate89% HIPAA certificationNo such certification existsCritical

Exhibit 3B

Hallucination Examples: What AI Says vs. Reality

Specific instances of AI-generated claims compared with verified data

Market Size

AI Output

$3.9B by 2028

"The global clinical decision support market is projected to reach $3.9 billion by 2028."

Verified Reality

$1.5B–$7.2B range

No consensus estimate exists. Published ranges vary by 4.8x depending on market definition and methodology.

Growth Rate (CAGR)

AI Output

32.8% CAGR

"The market is growing at a compound annual growth rate of 32.8% through 2030."

Verified Reality

20–39% range, no match

Published CAGR estimates range from 20% to 39%. No single source reports 32.8%. The figure is a statistical fabrication.

Survey Data

AI Output

67% of 1,200 CFOs

"According to a survey of 1,200 CFOs, 67% plan to increase AI spending in the next fiscal year."

Verified Reality

No such survey exists

No organization published a survey matching this description. The sample size, finding, and framing were entirely generated.

Competitor Revenue

AI Output

$340M revenue

"Company X reported annual revenue of approximately $340 million in its most recent fiscal year."

Verified Reality

Private company, no public data

Company X is privately held and has never disclosed revenue figures. No public filing or credible estimate exists.

03

Why LLMs Fabricate Numbers

The hallucination problem in consulting research is not a bug that will be fixed with the next model version. It is an architectural limitation inherent to how large language models work. Understanding why LLMs fabricate numbers is essential for any consulting professional who uses these tools.

LLMs are next-token prediction engines. They generate the most statistically likely continuation of a text sequence based on patterns learned during training. When asked “What is the TAM for the U.S. cybersecurity market?” the model does not look up a fact. It generates tokens that are likely to follow that question based on the patterns in its training data. The result often looks authoritative but has no guaranteed connection to reality.

The Core Problem

LLMs do not distinguish between “I know this fact” and “this text pattern is statistically likely.” When the model produces a market size figure, it has the same internal confidence whether the number is precisely correct, approximately correct, or completely fabricated. There is no uncertainty signal in the output.

Training data contamination compounds the problem. LLMs are trained on internet text that includes outdated reports, conflicting estimates, blog posts citing other blog posts, and marketing materials with inflated projections. The model cannot distinguish an authoritative source from a content marketing piece.

Confident wrong answers are the most dangerous output. LLMs generate numbers with decimal-point precision -- “$3.87B by 2028” -- that implies a level of sourcing rigor that does not exist. The precision is a feature of the text generation process, not the underlying data.

Citation fabrication removes the last line of defense. When asked to provide sources, LLMs generate plausible-sounding citations -- correct publisher names, realistic report titles, reasonable dates -- for documents that do not exist. A consultant who checks that “Gartner 2024 Magic Quadrant for Cloud Security” was cited may not realize the specific figures attributed to that report were never published.

The implication for consulting is stark: every numerical claim produced by an LLM must be treated as unverified until independently confirmed through primary sources. The traditional consulting workflow of “research, synthesize, present” breaks down when the research step produces fabricated data that looks identical to real data.

04

The Trust Chain Problem

A hallucinated statistic does not stay in the analyst's draft. It travels through a chain of trust -- from AI output to slide deck to partner presentation to client decision -- gaining credibility at each step without gaining accuracy. By the time a fabricated market size figure reaches the boardroom, it has been formatted, contextualized, and presented with the full authority of a top-tier consulting firm.

This trust degradation chain is the mechanism by which a single hallucinated data point becomes the foundation for a multi-million dollar decision. Each step in the chain adds perceived credibility while removing the possibility of detection.

Exhibit 4

Trust Degradation Chain -- From Hallucination to Decision

AI Generates Statistic

LLM produces a precise market figure with no source verification

Analyst Copies to Slide

Junior consultant pastes figure into client deliverable as fact

Manager Reviews for Plausibility

Number looks reasonable and passes pattern-matching review

Partner Presents to Client

Statistic is cited with authority in board-level presentation

Client Makes $M Decision

Investment, acquisition, or strategy pivot based on fabricated data

Exhibit 4B

Trust Score Degradation Across the Delivery Chain

As unverified data moves through each stage, perceived trust increases while actual verification remains at zero

AI Generates

0% verified

Perceived trust: 0%

Analyst Copies

Branded as research

Perceived trust: 25%

Manager Reviews

Narrative check only

Perceived trust: 55%

Partner Presents

Full firm credibility

Perceived trust: 85%

Client Decides

Material decisions

Perceived trust: 99%

UnverifiedPerceived as Verified

Actual verification status: 0% at every stage. Perceived credibility increases through formatting and authority alone.

The Formatting Effect

A number in a polished slide deck with a consulting firm's logo carries inherently more perceived authority than the same number in a raw AI chat output. The formatting process strips away any remaining skepticism about the data's provenance.

The Plausibility Trap

Managers review deliverables for plausibility, not accuracy. A fabricated market size of $3.8B for a healthcare IT segment passes the “does this seem reasonable?” test easily. The hallucinated number is specifically designed by the model to be plausible.

The Authority Cascade

When a senior partner presents a data point to a client, it carries the weight of the entire firm's reputation. The client has no reason to question the underlying source. The trust relationship between consultant and client substitutes for verification.

The Decision Amplifier

The final step is the most consequential. A fabricated TAM becomes the basis for an acquisition valuation. A hallucinated benchmark becomes the justification for a restructuring. The magnitude of the decision amplifies the cost of the original error.

05

Real-World Impact Scenarios

The following scenarios illustrate how AI-generated data errors propagate through the consulting workflow and result in material client harm. Each scenario is constructed from observed failure patterns and represents a plausible chain of events when AI-generated research is not independently verified.

The stakes in M&A are particularly severe. An analysis of 40,000 transactions over 40 years found that 70-75% of acquisitions fail to create value (Lev & Gu,The End of Accounting, Wiley, 2024). KPMG (2023) reported that 83% of mergers fail to boost shareholder returns. McKinsey has documented that companies routinely overestimate synergies by approximately 20%. Notable examples include HP's $10.2 billion acquisition of Autonomy (resulting in an approximately $8-9 billion writedown) and Bayer's $63 billion Monsanto acquisition (which destroyed more than $50 billion in shareholder value). When AI hallucinations compound these already-high base rates of failure, the consequences are amplified.

1

The PE Investment That Wasn’t

Private Equity, Healthcare IT

A mid-market PE firm commissioned a commercial due diligence report on a healthcare IT target. The consulting team used an LLM to size the target’s addressable market. The AI returned a TAM of $3.8B for the U.S. clinical decision support market by 2027, citing a “Grand View Research report.” The actual published estimates range from $600M to $1.1B depending on market definition. The $3.8B figure appears to conflate the broader health IT market with the specific clinical decision support segment. The PE firm underwrote the deal at a 6x revenue multiple based on the inflated TAM. The actual addressable market was approximately $800M—a 4.75x overstatement.

Consequence: Post-acquisition, revenue projections were unachievable. Write-down exceeded $40M within 18 months. LP confidence in the fund’s diligence process was materially damaged.

2

The Workforce Reduction Based on Ghost Benchmarks

Management Consulting, Retail

A global retailer engaged a consulting firm for an operational efficiency study. The team used AI to generate industry benchmarks for staff-to-revenue ratios, distribution center throughput, and corporate overhead as a percentage of revenue. The LLM produced specific benchmark figures—“retail industry average of 1 FTE per $285K revenue” and “top-quartile DC throughput of 847 units/hour.” Neither benchmark had a verifiable source. The actual retail staffing benchmarks vary enormously by sub-sector, geography, and business model. The consulting recommendation called for a 340-person reduction in force.

Consequence: The layoffs disrupted store operations and distribution. Remediation costs, including rehiring and severance, exceeded $28M. The client terminated the consulting engagement.

3

The Fundraise That Collapsed

Venture Capital, Climate Tech

A Series B climate tech startup hired a strategy firm to prepare investor materials. The AI-generated market analysis claimed the voluntary carbon credit market would reach $15.7B by 2028, citing “BloombergNEF and McKinsey projections.” The actual published estimates at the time ranged from $1.5B to $2.5B for the voluntary market. The $15.7B figure appears to include compliance markets, which operate under entirely different dynamics. The inflated figure was central to the startup’s pitch deck and financial model.

Consequence: Lead investors conducted independent verification during due diligence and discovered the discrepancy. The fundraise collapsed. The startup missed its funding window and was forced into a down round.

06

Why Current Safeguards Fail

Consulting firms are not unaware of AI risks. Most have implemented some form of AI usage guidelines, and many have added “verify AI outputs” to their quality control checklists. Yet the verification gap persists. The reasons are structural, not procedural -- the nature of consulting work makes thorough AI output verification exceptionally difficult with current approaches.

1

The Scale Problem

A typical market study contains 50-200 individual data points. Manually verifying each one against primary sources would take longer than producing the research without AI. The productivity gain that motivated AI adoption is erased by comprehensive verification.

2

The Plausibility Trap

LLM hallucinations are specifically calibrated to be plausible. The model generates numbers that “feel right” based on the context. A reviewer scanning for obvious errors will not catch a TAM that is 3x too high if it falls within the broad range of what seems reasonable for the industry.

3

Citation Fabrication

When consultants ask the AI to cite its sources, the model generates realistic citations -- correct publisher names, plausible report titles, reasonable dates. Checking that “Grand View Research, 2024” published a report is easy. Verifying the specific number attributed to that report requires purchasing access.

4

Confirmation Bias

Consultants often have a preliminary hypothesis before they begin research. AI outputs that confirm this hypothesis are accepted more readily. A market size figure that supports the client's growth story is less likely to be questioned than one that undermines it.

5

The Disclosure Gap

Most consulting deliverables do not disclose AI involvement. Clients have no way to know which data points were sourced by a human analyst from verified databases and which were generated by an LLM. Without disclosure, clients cannot apply appropriate skepticism or request additional verification for AI-sourced claims.

07

The Liability Question

When a consulting deliverable contains fabricated data that leads to client harm, the liability question is not theoretical. Consulting firms operate under professional service agreements that typically include representations about the quality and accuracy of their work product. AI-generated hallucinations introduce a new category of risk that existing frameworks were not designed to address.

The legal and regulatory landscape is evolving rapidly, and consulting firms that fail to establish adequate AI verification procedures face exposure across multiple dimensions.

Professional Liability

Consulting engagement letters typically include warranties about the quality of work product. A deliverable containing fabricated market data that leads to a failed acquisition or misguided strategy could constitute a breach of professional duty. The fact that the error was generated by AI rather than a human analyst does not diminish the firm's responsibility -- the client hired the firm, not the AI.

E&O Insurance Exposure

Errors and omissions policies are being reassessed across the professional services industry. Insurers are beginning to add AI-specific exclusions or requiring disclosure of AI usage in underwriting questionnaires. Firms using AI without documented verification procedures may find their E&O coverage does not extend to AI-generated errors, creating uninsured professional risk.

Regulatory Risk (SEC, FINRA, EU AI Act)

Consulting deliverables that inform regulated activities -- securities offerings, M&A transactions, public company strategy -- fall under regulatory scrutiny. The SEC has signaled increased attention to AI-generated content in deal materials. FINRA has issued guidance on AI use in broker-dealer contexts. The EU AI Act classifies certain high-risk AI applications that may encompass consulting for financial decision-making. Firms that ship AI-fabricated data into regulated processes face potential enforcement action.

Documented Cases

AI Hallucination Incidents with Material Consequences

These are not hypothetical scenarios -- they are documented events with verified financial and legal consequences. The AI Hallucination Cases Database tracks 1,031 documented cases worldwide as of late 2025.

Mata v. Avianca (S.D.N.Y., 2023)

Attorney Steven Schwartz used ChatGPT for legal research and submitted a brief containing six or more entirely fabricated case citations. The court imposed a $5,000 fine. The case became a landmark example of AI hallucination in professional practice and prompted multiple bar associations to issue AI usage guidelines.

Deloitte Australia Government Report (2024)

Deloitte Australia delivered a AU$440,000 government report that contained more than 20 AI hallucinations, including 12 references to a fabricated university research report that did not exist. The incident demonstrated that even Big Four firms with extensive quality controls are vulnerable to AI-generated fabrications passing through review processes undetected.

08

The Computation Layer Approach

The solution is not to stop using AI in consulting. It is to architect AI systems that separate what LLMs do well -- language understanding, synthesis, and communication -- from what they cannot do reliably: factual data retrieval and numerical computation. This separation is the computation layer approach.

A properly architected system routes research queries through a computation layer that retrieves data from verified sources, performs calculations deterministically, and returns results with full source provenance. The language layer handles everything else: interpreting the consultant's question, formatting the output, and generating the narrative around verified data points.

Exhibit 5

Three-Layer Consulting AI Architecture

Language Layer

Natural language understanding, intent parsing, research question formulation, and narrative generation for client deliverables.

Query interpretationContext assemblyNarrative draftingReport formatting

Computation Layer

Deterministic retrieval and calculation engine that sources all numerical data from verified databases with full provenance tracking.

Market data retrievalStatistical computationSource verificationCross-validation

Validation Layer

Independent verification of every data point against primary sources, with confidence scoring and uncertainty quantification.

Source traceability auditAuthority scoringTemporal validity checkConflict detection

Data flows downward through layers; validation flows upward.

Exhibit 6

Tiered Verification Protocol

Critical

Data Types

Market size figures, TAM/SAM/SOM, investment thesis data points, financial projections used in valuation

Validation Required

Primary source verification required. Minimum two independent sources with direct citations. Computation layer cross-validation mandatory.

High

Data Types

Growth rates, competitive benchmarks, industry ratios, survey statistics cited in recommendations

Validation Required

Source traceability required. At least one authoritative primary source. Confidence scoring with disclosure of uncertainty range.

Standard

Data Types

General industry trends, qualitative market descriptions, technology landscape summaries

Validation Required

Plausibility check against known data. AI-generated content flagged as unverified where sources cannot be confirmed.

Low

Data Types

Internal process documentation, meeting summaries, administrative templates, formatting assistance

Validation Required

Standard review procedures. No additional verification required beyond normal quality control.

Exhibit 7

Confidence Scoring Methodology

Every data point in a computation-layer system carries a composite confidence score derived from five independent components. This score enables consultants and clients to understand the reliability of each claim.

Source Traceability

95%

Authority Score

88%

Temporal Validity

72%

Cross-Validation

90%

Computational Verification

98%

Exhibit 7B

Verification Cost-Benefit Analysis

Manual verification vs. computation layer economics per engagement

Manual Verification

Traditional approach

Time Required

100–400 hours

Per engagement, depending on data point volume

Cost

$15K–$60K

Additional cost per engagement for verification staff

Scalability

Not Scalable

Linear cost increase with engagement complexity

Coverage

10–15%

Only highest-priority data points can be checked

Computation Layer

Automated verification

Time Required

Real-Time

Verification runs inline with every data retrieval

Cost

$0 Marginal

No incremental cost per data point after platform deployment

Scalability

Fully Scalable

Handles thousands of data points per engagement identically

Coverage

100%

Every numerical claim verified against primary sources

Exhibit 8

Current AI vs. Computation Layer Approach

CapabilityCurrent AI ApproachVerified Computation Approach
Market SizingLLM generates TAM from training dataRetrieves published estimates with source citations and range disclosure
Growth RatesSingle-point CAGR from blended sourcesRange of published estimates with methodology transparency
Competitor DataFabricated financials for private companiesFlags data availability; uses only disclosed or estimated ranges
Survey CitationsGenerated from training patternsVerified against live publication databases with DOI/URL links
Audit TrailNoneFull lineage from query to source to output
Confidence ScoringNone -- all outputs presented as factEvery data point carries a confidence score with methodology disclosure
09

Recommendations

Based on the analysis presented in this paper, we offer the following recommendations organized by stakeholder group. These recommendations are designed to be actionable immediately while supporting the long-term development of verified AI systems for consulting.

For Consulting Firms

  • Implement the tiered verification protocol immediately across all engagement teams, classifying every data point by risk level.
  • Require primary source verification for all Critical-tier data points before inclusion in any client deliverable.
  • Disclose AI usage to clients and flag any data points where primary source verification was not completed.
  • Evaluate AI tools for computation layer architecture before procurement -- require source traceability as a minimum.
  • Train all consultants on LLM failure modes specific to research and numerical claims.

For Clients Engaging Consultants

  • Ask consulting firms to disclose their AI usage policies and verification procedures as part of the proposal process.
  • Request source citations for all key data points in deliverables, especially market sizing and financial projections.
  • Independently verify the three to five most decision-critical data points before acting on consulting recommendations.
  • Include AI data quality provisions in consulting engagement agreements, requiring disclosure and verification standards.

For Industry Bodies & Regulators

  • Develop professional standards for AI use in consulting engagements, including minimum verification requirements for numerical claims.
  • Require AI disclosure in consulting deliverables that inform regulated activities such as securities offerings and M&A transactions.
  • Establish data provenance standards that mandate source traceability for all quantitative claims in professional work product.
  • Update E&O insurance frameworks to address AI-generated errors with clear coverage requirements tied to verification procedures.
10

Conclusion

The consulting industry sells trust. Clients pay premium fees because they trust that the data in a consulting deliverable has been researched, verified, and stress-tested by experienced professionals. AI has the potential to make this process faster and more comprehensive -- but only if the data it produces is real.

Today, it often is not. LLMs hallucinate market sizes, fabricate growth rates, invent survey statistics, and generate confident numbers for private companies that have never disclosed their financials. These hallucinated data points travel through the trust chain unchanged -- from AI output to analyst draft to partner presentation to client decision -- gaining credibility at every step without gaining accuracy.

The consequences are material: PE firms overpaying for acquisitions based on inflated TAMs, companies restructuring based on fabricated benchmarks, startups failing to close funding rounds when investors discover the real numbers. These are not hypothetical risks. They are happening now.

The path forward requires architectural change, not procedural patches. Language models must be separated from data retrieval and computation. Every numerical claim must carry source provenance and a confidence score. Every deliverable must disclose AI involvement and verification status. The computation layer approach preserves the productivity benefits of AI while eliminating the risk of shipping fabricated data to clients.

The numbers in a consulting deliverable must be real. Not plausibly generated. Not statistically likely. Real. Verified from primary sources, computed deterministically, and delivered with full provenance. That is the standard clients deserve, and any AI system used in consulting must meet it.

Build Verified AI Research Systems

Dojo Labs engineers AI systems with computation layers that ensure every data point is retrieved, verified, and sourced -- not generated. Talk to us about building accuracy into your research infrastructure.

Talk to Our Team

References & Sources

Li, J., Cheng, X., Zhao, W.X., Nie, J.-Y. & Wen, J.-R. (2023). “HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models.” Proceedings of EMNLP 2023. Measured ~19.5% hallucination rate across QA, dialogue, and summarization tasks.

Lin, S., Hilton, J. & Evans, O. (2022). “TruthfulQA: Measuring How Models Mimic Human Falsehoods.” Proceedings of ACL 2022. Benchmark for evaluating truthfulness of language model generations.

Stanford Institute for Human-Centered AI (HAI) (2025). “AI Hallucination Rates in Legal Research.” Found LLMs hallucinated 58-82% of the time on legal research queries.

Wei, J. et al. (2024). “SimpleQA: Measuring Short-Form Factuality in Large Language Models.” OpenAI, October 2024. GPT-4o achieved 38.2% accuracy on simple factual questions.

Vectara Hallucination Leaderboard (February 2026). Ongoing benchmark placing best-performing models at 1.8-5% hallucination rate on grounded summarization tasks.

McKinsey & Company (2025). “The State of AI in 2025.” Found 79% of organizations regularly using generative AI.

McKinsey & Company (H2 2024). Survey finding that only 27% of organizations review all AI-generated content before use.

Boston Consulting Group (BCG) (October 2024). “From Potential to Profit with GenAI.” Found 98% of CxOs experimenting with AI; only 4% creating substantial value.

Mata v. Avianca, Inc. (S.D.N.Y. 2023). Attorney sanctioned $5,000 for submitting brief with 6+ fabricated AI-generated case citations.

Deloitte Australia (2024). AU$440,000 government report found to contain 20+ AI hallucinations, including 12 references to a fabricated university research report.

AI Hallucination Cases Database (Late 2025). Tracks 1,031 documented AI hallucination cases worldwide across legal, consulting, journalism, and healthcare contexts.

Lev, B. & Gu, F. (2024). The End of Accounting and the Path Forward for Investors and Managers. Wiley. Analysis of 40,000 M&A transactions over 40 years showing 70-75% acquisition failure rate.

KPMG (2023). “M&A Integration: Why Deals Fail.” Found 83% of mergers fail to boost shareholder returns.

Jullien, M. et al. (2025). “Hallucination rates and reference accuracy of ChatGPT and Bard for systematic reviews.” npj Digital Medicine (Nature Publishing Group). GPT-4 showed 1.47% hallucination rate and 3.45% omission rate across 12,999 clinician-annotated sentences; 44% of hallucinations classified as major.

Sallam, M. et al. (2025). “Hallucination in medical AI.” Communications Medicine. Systematic review of hallucination in biomedical and clinical AI applications.

Xu, Z., Jain, S. & Kankanhalli, M. (2024). “Hallucination is Inevitable: An Innate Limitation of Large Language Models.” National University of Singapore, arXiv:2401.11817. Formally proved that hallucination is mathematically impossible to eliminate in LLMs trained on finite data.

Zhou, Y. et al. (2024). “Tool-Augmented Language Models Reduce but Do Not Eliminate Hallucination.” ICLR 2024. Demonstrated that tool augmentation reduces but cannot fully prevent hallucination.

Mirzadeh, I. et al. (2025). “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.” Apple, ICLR 2025. Showed that LLM mathematical reasoning is fragile and performance degrades with superficial problem modifications.

D

About Dojo Labs

Dojo Labs builds and fixes AI systems where every number is computed, not guessed. We specialize in engineering accuracy into AI applications for professional services, including consulting, financial services, and due diligence. Our computation layer architecture ensures that AI-assisted research delivers verified, sourced, and auditable data outputs.

This whitepaper is published by Dojo Labs for informational purposes. It does not constitute legal, financial, or professional advice. The scenarios described are illustrative and constructed from observed failure patterns. Data and statistics cited herein are sourced from peer-reviewed publications (EMNLP, ACL, ICLR, npj Digital Medicine, Communications Medicine), industry benchmarks (OpenAI SimpleQA, Vectara Hallucination Leaderboard, TruthfulQA), consulting firm research (McKinsey, BCG, KPMG), academic preprints (arXiv), court records (Mata v. Avianca), and public reporting (Deloitte Australia incident, AI Hallucination Cases Database). Full citations appear in the References section above. All data should be independently verified before use in decision-making. © 2026 Dojo Labs. All rights reserved.