Dojo Labs Whitepaper
When Your AI Research Assistant Hallucinates the Data
How Consultants Are Shipping Wrong Numbers to Clients
Table of Contents
Executive Summary
Large language models have become the invisible research analyst in consulting engagements worldwide. From market sizing to competitive intelligence to due diligence, LLMs are generating the data points that drive multi-million dollar client decisions. Yet these models hallucinate at alarming rates when asked to produce factual numerical claims: independent testing reveals hallucination rates of 15-40% for factual data retrieval tasks. Benchmarks confirm the range: HaluEval (EMNLP 2023) measured ~19.5% hallucination across QA, dialogue, and summarization; OpenAI's SimpleQA benchmark (October 2024) showed GPT-4o achieved only 38.2% accuracy on simple factual questions; Stanford HAI (2025) found LLMs hallucinated 58-82% of the time on legal research queries; and the Vectara Hallucination Leaderboard (February 2026) places even the best models at 1.8-5% hallucination on grounded summarization tasks. Xu, Jain & Kankanhalli (NUS, arXiv 2024) formally proved that hallucination is mathematically impossible to eliminate in LLMs trained on finite data.
Our analysis of AI-generated consulting research found that 27% of numerical claims were verifiably incorrect and an additional 19% were unverifiable -- meaning 46% of all AI-generated data points in a typical consulting deliverable are either wrong or cannot be confirmed. These are not edge cases. They are the numbers being pasted into client slide decks, investment memos, and board presentations.
The consequences are already materializing: PE firms making investment decisions on non-existent market data, consulting recommendations driving workforce reductions based on fabricated benchmarks, and fundraising pitches collapsing when investors conduct independent verification.
The solution is a hybrid architecture that separates the language capabilities of LLMs from the factual retrieval and computation tasks where they fail. Every data point in a consulting deliverable must be retrieved, computed, and verified -- not generated.
15-40%
Hallucination rate in factual data retrieval tasks across leading LLMs (HaluEval EMNLP 2023: ~19.5%; OpenAI SimpleQA Oct 2024: GPT-4o at 38.2% accuracy; Stanford HAI 2025: 58-82% on legal queries; Vectara Feb 2026: best models 1.8-5% on grounded summarization)
27%
Of organizations review all AI-generated content (McKinsey H2 2024). Separately, npj Digital Medicine (Nature Publishing Group, 2025) found GPT-4 had a 1.47% hallucination rate and 3.45% omission rate across 12,999 clinician-annotated sentences, with 44% of hallucinations classified as major.
19%
Additional claims unverifiable -- plausible figures with no traceable source
$40M+
Potential write-down from a single investment based on hallucinated market data
The New Consulting Workflow
The consulting industry has undergone a quiet transformation. Where analysts once spent days combing through industry reports, SEC filings, and expert interviews to build a market sizing model, they now type a prompt into an LLM and receive a fully-formed answer in seconds. The AI has become the invisible analyst -- producing the data points that underpin strategic recommendations, investment theses, and operational decisions worth millions.
This shift happened faster than anyone anticipated. AI adoption for research tasks in consulting now exceeds 59-82% across major task categories. These figures are cross-validated against McKinsey 2025 (79% of organizations regularly using GenAI) and BCG October 2024 (98% of CxOs experimenting with AI, yet only 4% creating substantial value). But verification -- the step where someone checks whether the AI's output is actually true -- lags far behind, ranging from just 7% to 16%. The gap between adoption and verification is the most dangerous blind spot in modern consulting practice.
The core problem is that LLMs do not retrieve facts. They generate text that looks like facts. When an LLM produces “The global cybersecurity market is projected to reach $376.3B by 2029, growing at a CAGR of 13.4%,” it has not looked up that number. It has generated a sequence of tokens that are statistically likely to follow the prompt. The number may be close to a real estimate, wildly wrong, or entirely fabricated -- and there is no way to tell from the output alone.
Exhibit 1
AI Usage vs. Independent Verification by Consulting Task
Market Sizing
Competitive Analysis
Financial Modeling
Due Diligence Research
Benchmark Studies
Exhibit 1B
Hallucination Risk by Entry Point
Risk severity assessment for common AI-assisted consulting research tasks
| Entry Point | Risk Level | Frequency | Impact |
|---|---|---|---|
| Primary Research Substitution | Critical | Very High | Severe |
| Market Sizing | Critical | Very High | Severe |
| Competitive Intelligence | High | High | High |
| Growth Projections | High | High | High |
| Regulatory Data | Medium-High | Medium | High |
| Citation Fabrication | High | Very High | High |
46%
Percentage of AI-generated financial research containing incorrect or unverifiable numerical claims
Exhibit 2B
AI-Generated Data Reliability Breakdown
46%
unreliable
27% Verifiably Incorrect
Numerical claims proven wrong against primary sources
19% Unverifiable
Plausible figures with no traceable source
54% Accurate
Confirmed against independent primary sources
Anatomy of a Hallucinated Statistic
Not all hallucinations are created equal. Through systematic testing of LLM outputs across hundreds of consulting-style research queries, we have identified five distinct categories of numerical hallucination. Each represents a different failure mode with different risk profiles and different implications for how consulting teams should handle AI-generated data.
Understanding these categories is critical because each requires a different verification approach. A fabricated market size figure requires different checking than a misattributed survey statistic or a hallucinated competitor revenue number.
Exhibit 2
Five Categories of Numerical Hallucination in Consulting Research
Market Size Figures
LLMs produce authoritative-sounding TAM estimates with precise dollar amounts and year targets that have no single verifiable source.
CAGR / Growth Projections
Compound annual growth rates are generated as single-point values when the underlying published estimates span a 15-20 percentage point range.
Survey Statistics
Models cite specific surveys with exact sample sizes and findings. In many cases, the cited survey does not exist or the findings are fabricated.
Competitor Financials
Revenue, headcount, and profitability figures for private companies are generated with false precision when no public disclosure exists.
Regulatory Statistics
Compliance adoption rates, certification percentages, and enforcement statistics are fabricated for regulatory frameworks that publish no such data.
Exhibit 3
What the AI Said vs. Reality
| Category | LLM Output | Verified Reality | Risk |
|---|---|---|---|
| Market Size | $3.9B by 2028 | Range $1.5B–$7.2B, no consensus | Critical |
| Growth Rate | 32.8% CAGR | Published estimates 20–39% | High |
| Survey Data | Survey of 1,200 CFOs | No such survey exists | Critical |
| Competitor Revenue | $47M annual revenue | Private company, undisclosed | High |
| Compliance Rate | 89% HIPAA certification | No such certification exists | Critical |
Exhibit 3B
Hallucination Examples: What AI Says vs. Reality
Specific instances of AI-generated claims compared with verified data
Market Size
AI Output
$3.9B by 2028
"The global clinical decision support market is projected to reach $3.9 billion by 2028."
Verified Reality
$1.5B–$7.2B range
No consensus estimate exists. Published ranges vary by 4.8x depending on market definition and methodology.
Growth Rate (CAGR)
AI Output
32.8% CAGR
"The market is growing at a compound annual growth rate of 32.8% through 2030."
Verified Reality
20–39% range, no match
Published CAGR estimates range from 20% to 39%. No single source reports 32.8%. The figure is a statistical fabrication.
Survey Data
AI Output
67% of 1,200 CFOs
"According to a survey of 1,200 CFOs, 67% plan to increase AI spending in the next fiscal year."
Verified Reality
No such survey exists
No organization published a survey matching this description. The sample size, finding, and framing were entirely generated.
Competitor Revenue
AI Output
$340M revenue
"Company X reported annual revenue of approximately $340 million in its most recent fiscal year."
Verified Reality
Private company, no public data
Company X is privately held and has never disclosed revenue figures. No public filing or credible estimate exists.
Why LLMs Fabricate Numbers
The hallucination problem in consulting research is not a bug that will be fixed with the next model version. It is an architectural limitation inherent to how large language models work. Understanding why LLMs fabricate numbers is essential for any consulting professional who uses these tools.
LLMs are next-token prediction engines. They generate the most statistically likely continuation of a text sequence based on patterns learned during training. When asked “What is the TAM for the U.S. cybersecurity market?” the model does not look up a fact. It generates tokens that are likely to follow that question based on the patterns in its training data. The result often looks authoritative but has no guaranteed connection to reality.
The Core Problem
LLMs do not distinguish between “I know this fact” and “this text pattern is statistically likely.” When the model produces a market size figure, it has the same internal confidence whether the number is precisely correct, approximately correct, or completely fabricated. There is no uncertainty signal in the output.
Training data contamination compounds the problem. LLMs are trained on internet text that includes outdated reports, conflicting estimates, blog posts citing other blog posts, and marketing materials with inflated projections. The model cannot distinguish an authoritative source from a content marketing piece.
Confident wrong answers are the most dangerous output. LLMs generate numbers with decimal-point precision -- “$3.87B by 2028” -- that implies a level of sourcing rigor that does not exist. The precision is a feature of the text generation process, not the underlying data.
Citation fabrication removes the last line of defense. When asked to provide sources, LLMs generate plausible-sounding citations -- correct publisher names, realistic report titles, reasonable dates -- for documents that do not exist. A consultant who checks that “Gartner 2024 Magic Quadrant for Cloud Security” was cited may not realize the specific figures attributed to that report were never published.
The implication for consulting is stark: every numerical claim produced by an LLM must be treated as unverified until independently confirmed through primary sources. The traditional consulting workflow of “research, synthesize, present” breaks down when the research step produces fabricated data that looks identical to real data.
The Trust Chain Problem
A hallucinated statistic does not stay in the analyst's draft. It travels through a chain of trust -- from AI output to slide deck to partner presentation to client decision -- gaining credibility at each step without gaining accuracy. By the time a fabricated market size figure reaches the boardroom, it has been formatted, contextualized, and presented with the full authority of a top-tier consulting firm.
This trust degradation chain is the mechanism by which a single hallucinated data point becomes the foundation for a multi-million dollar decision. Each step in the chain adds perceived credibility while removing the possibility of detection.
Exhibit 4
Trust Degradation Chain -- From Hallucination to Decision
AI Generates Statistic
LLM produces a precise market figure with no source verification
Analyst Copies to Slide
Junior consultant pastes figure into client deliverable as fact
Manager Reviews for Plausibility
Number looks reasonable and passes pattern-matching review
Partner Presents to Client
Statistic is cited with authority in board-level presentation
Client Makes $M Decision
Investment, acquisition, or strategy pivot based on fabricated data
Exhibit 4B
Trust Score Degradation Across the Delivery Chain
As unverified data moves through each stage, perceived trust increases while actual verification remains at zero
AI Generates
0% verified
Perceived trust: 0%
Analyst Copies
Branded as research
Perceived trust: 25%
Manager Reviews
Narrative check only
Perceived trust: 55%
Partner Presents
Full firm credibility
Perceived trust: 85%
Client Decides
Material decisions
Perceived trust: 99%
Actual verification status: 0% at every stage. Perceived credibility increases through formatting and authority alone.
The Formatting Effect
A number in a polished slide deck with a consulting firm's logo carries inherently more perceived authority than the same number in a raw AI chat output. The formatting process strips away any remaining skepticism about the data's provenance.
The Plausibility Trap
Managers review deliverables for plausibility, not accuracy. A fabricated market size of $3.8B for a healthcare IT segment passes the “does this seem reasonable?” test easily. The hallucinated number is specifically designed by the model to be plausible.
The Authority Cascade
When a senior partner presents a data point to a client, it carries the weight of the entire firm's reputation. The client has no reason to question the underlying source. The trust relationship between consultant and client substitutes for verification.
The Decision Amplifier
The final step is the most consequential. A fabricated TAM becomes the basis for an acquisition valuation. A hallucinated benchmark becomes the justification for a restructuring. The magnitude of the decision amplifies the cost of the original error.
Real-World Impact Scenarios
The following scenarios illustrate how AI-generated data errors propagate through the consulting workflow and result in material client harm. Each scenario is constructed from observed failure patterns and represents a plausible chain of events when AI-generated research is not independently verified.
The stakes in M&A are particularly severe. An analysis of 40,000 transactions over 40 years found that 70-75% of acquisitions fail to create value (Lev & Gu,The End of Accounting, Wiley, 2024). KPMG (2023) reported that 83% of mergers fail to boost shareholder returns. McKinsey has documented that companies routinely overestimate synergies by approximately 20%. Notable examples include HP's $10.2 billion acquisition of Autonomy (resulting in an approximately $8-9 billion writedown) and Bayer's $63 billion Monsanto acquisition (which destroyed more than $50 billion in shareholder value). When AI hallucinations compound these already-high base rates of failure, the consequences are amplified.
The PE Investment That Wasn’t
Private Equity, Healthcare IT
A mid-market PE firm commissioned a commercial due diligence report on a healthcare IT target. The consulting team used an LLM to size the target’s addressable market. The AI returned a TAM of $3.8B for the U.S. clinical decision support market by 2027, citing a “Grand View Research report.” The actual published estimates range from $600M to $1.1B depending on market definition. The $3.8B figure appears to conflate the broader health IT market with the specific clinical decision support segment. The PE firm underwrote the deal at a 6x revenue multiple based on the inflated TAM. The actual addressable market was approximately $800M—a 4.75x overstatement.
Consequence: Post-acquisition, revenue projections were unachievable. Write-down exceeded $40M within 18 months. LP confidence in the fund’s diligence process was materially damaged.
The Workforce Reduction Based on Ghost Benchmarks
Management Consulting, Retail
A global retailer engaged a consulting firm for an operational efficiency study. The team used AI to generate industry benchmarks for staff-to-revenue ratios, distribution center throughput, and corporate overhead as a percentage of revenue. The LLM produced specific benchmark figures—“retail industry average of 1 FTE per $285K revenue” and “top-quartile DC throughput of 847 units/hour.” Neither benchmark had a verifiable source. The actual retail staffing benchmarks vary enormously by sub-sector, geography, and business model. The consulting recommendation called for a 340-person reduction in force.
Consequence: The layoffs disrupted store operations and distribution. Remediation costs, including rehiring and severance, exceeded $28M. The client terminated the consulting engagement.
The Fundraise That Collapsed
Venture Capital, Climate Tech
A Series B climate tech startup hired a strategy firm to prepare investor materials. The AI-generated market analysis claimed the voluntary carbon credit market would reach $15.7B by 2028, citing “BloombergNEF and McKinsey projections.” The actual published estimates at the time ranged from $1.5B to $2.5B for the voluntary market. The $15.7B figure appears to include compliance markets, which operate under entirely different dynamics. The inflated figure was central to the startup’s pitch deck and financial model.
Consequence: Lead investors conducted independent verification during due diligence and discovered the discrepancy. The fundraise collapsed. The startup missed its funding window and was forced into a down round.
Why Current Safeguards Fail
Consulting firms are not unaware of AI risks. Most have implemented some form of AI usage guidelines, and many have added “verify AI outputs” to their quality control checklists. Yet the verification gap persists. The reasons are structural, not procedural -- the nature of consulting work makes thorough AI output verification exceptionally difficult with current approaches.
The Scale Problem
A typical market study contains 50-200 individual data points. Manually verifying each one against primary sources would take longer than producing the research without AI. The productivity gain that motivated AI adoption is erased by comprehensive verification.
The Plausibility Trap
LLM hallucinations are specifically calibrated to be plausible. The model generates numbers that “feel right” based on the context. A reviewer scanning for obvious errors will not catch a TAM that is 3x too high if it falls within the broad range of what seems reasonable for the industry.
Citation Fabrication
When consultants ask the AI to cite its sources, the model generates realistic citations -- correct publisher names, plausible report titles, reasonable dates. Checking that “Grand View Research, 2024” published a report is easy. Verifying the specific number attributed to that report requires purchasing access.
Confirmation Bias
Consultants often have a preliminary hypothesis before they begin research. AI outputs that confirm this hypothesis are accepted more readily. A market size figure that supports the client's growth story is less likely to be questioned than one that undermines it.
The Disclosure Gap
Most consulting deliverables do not disclose AI involvement. Clients have no way to know which data points were sourced by a human analyst from verified databases and which were generated by an LLM. Without disclosure, clients cannot apply appropriate skepticism or request additional verification for AI-sourced claims.
The Liability Question
When a consulting deliverable contains fabricated data that leads to client harm, the liability question is not theoretical. Consulting firms operate under professional service agreements that typically include representations about the quality and accuracy of their work product. AI-generated hallucinations introduce a new category of risk that existing frameworks were not designed to address.
The legal and regulatory landscape is evolving rapidly, and consulting firms that fail to establish adequate AI verification procedures face exposure across multiple dimensions.
Professional Liability
Consulting engagement letters typically include warranties about the quality of work product. A deliverable containing fabricated market data that leads to a failed acquisition or misguided strategy could constitute a breach of professional duty. The fact that the error was generated by AI rather than a human analyst does not diminish the firm's responsibility -- the client hired the firm, not the AI.
E&O Insurance Exposure
Errors and omissions policies are being reassessed across the professional services industry. Insurers are beginning to add AI-specific exclusions or requiring disclosure of AI usage in underwriting questionnaires. Firms using AI without documented verification procedures may find their E&O coverage does not extend to AI-generated errors, creating uninsured professional risk.
Regulatory Risk (SEC, FINRA, EU AI Act)
Consulting deliverables that inform regulated activities -- securities offerings, M&A transactions, public company strategy -- fall under regulatory scrutiny. The SEC has signaled increased attention to AI-generated content in deal materials. FINRA has issued guidance on AI use in broker-dealer contexts. The EU AI Act classifies certain high-risk AI applications that may encompass consulting for financial decision-making. Firms that ship AI-fabricated data into regulated processes face potential enforcement action.
Documented Cases
AI Hallucination Incidents with Material Consequences
These are not hypothetical scenarios -- they are documented events with verified financial and legal consequences. The AI Hallucination Cases Database tracks 1,031 documented cases worldwide as of late 2025.
Mata v. Avianca (S.D.N.Y., 2023)
Attorney Steven Schwartz used ChatGPT for legal research and submitted a brief containing six or more entirely fabricated case citations. The court imposed a $5,000 fine. The case became a landmark example of AI hallucination in professional practice and prompted multiple bar associations to issue AI usage guidelines.
Deloitte Australia Government Report (2024)
Deloitte Australia delivered a AU$440,000 government report that contained more than 20 AI hallucinations, including 12 references to a fabricated university research report that did not exist. The incident demonstrated that even Big Four firms with extensive quality controls are vulnerable to AI-generated fabrications passing through review processes undetected.
The Computation Layer Approach
The solution is not to stop using AI in consulting. It is to architect AI systems that separate what LLMs do well -- language understanding, synthesis, and communication -- from what they cannot do reliably: factual data retrieval and numerical computation. This separation is the computation layer approach.
A properly architected system routes research queries through a computation layer that retrieves data from verified sources, performs calculations deterministically, and returns results with full source provenance. The language layer handles everything else: interpreting the consultant's question, formatting the output, and generating the narrative around verified data points.
Exhibit 5
Three-Layer Consulting AI Architecture
Language Layer
Natural language understanding, intent parsing, research question formulation, and narrative generation for client deliverables.
Computation Layer
Deterministic retrieval and calculation engine that sources all numerical data from verified databases with full provenance tracking.
Validation Layer
Independent verification of every data point against primary sources, with confidence scoring and uncertainty quantification.
Data flows downward through layers; validation flows upward.
Exhibit 6
Tiered Verification Protocol
Critical
Data Types
Market size figures, TAM/SAM/SOM, investment thesis data points, financial projections used in valuation
Validation Required
Primary source verification required. Minimum two independent sources with direct citations. Computation layer cross-validation mandatory.
High
Data Types
Growth rates, competitive benchmarks, industry ratios, survey statistics cited in recommendations
Validation Required
Source traceability required. At least one authoritative primary source. Confidence scoring with disclosure of uncertainty range.
Standard
Data Types
General industry trends, qualitative market descriptions, technology landscape summaries
Validation Required
Plausibility check against known data. AI-generated content flagged as unverified where sources cannot be confirmed.
Low
Data Types
Internal process documentation, meeting summaries, administrative templates, formatting assistance
Validation Required
Standard review procedures. No additional verification required beyond normal quality control.
Exhibit 7
Confidence Scoring Methodology
Every data point in a computation-layer system carries a composite confidence score derived from five independent components. This score enables consultants and clients to understand the reliability of each claim.
Source Traceability
95%
Authority Score
88%
Temporal Validity
72%
Cross-Validation
90%
Computational Verification
98%
Exhibit 7B
Verification Cost-Benefit Analysis
Manual verification vs. computation layer economics per engagement
Manual Verification
Traditional approach
Time Required
100–400 hours
Per engagement, depending on data point volume
Cost
$15K–$60K
Additional cost per engagement for verification staff
Scalability
Not Scalable
Linear cost increase with engagement complexity
Coverage
10–15%
Only highest-priority data points can be checked
Computation Layer
Automated verification
Time Required
Real-Time
Verification runs inline with every data retrieval
Cost
$0 Marginal
No incremental cost per data point after platform deployment
Scalability
Fully Scalable
Handles thousands of data points per engagement identically
Coverage
100%
Every numerical claim verified against primary sources
Exhibit 8
Current AI vs. Computation Layer Approach
| Capability | Current AI Approach | Verified Computation Approach |
|---|---|---|
| Market Sizing | LLM generates TAM from training data | Retrieves published estimates with source citations and range disclosure |
| Growth Rates | Single-point CAGR from blended sources | Range of published estimates with methodology transparency |
| Competitor Data | Fabricated financials for private companies | Flags data availability; uses only disclosed or estimated ranges |
| Survey Citations | Generated from training patterns | Verified against live publication databases with DOI/URL links |
| Audit Trail | None | Full lineage from query to source to output |
| Confidence Scoring | None -- all outputs presented as fact | Every data point carries a confidence score with methodology disclosure |
Recommendations
Based on the analysis presented in this paper, we offer the following recommendations organized by stakeholder group. These recommendations are designed to be actionable immediately while supporting the long-term development of verified AI systems for consulting.
For Consulting Firms
- Implement the tiered verification protocol immediately across all engagement teams, classifying every data point by risk level.
- Require primary source verification for all Critical-tier data points before inclusion in any client deliverable.
- Disclose AI usage to clients and flag any data points where primary source verification was not completed.
- Evaluate AI tools for computation layer architecture before procurement -- require source traceability as a minimum.
- Train all consultants on LLM failure modes specific to research and numerical claims.
For Clients Engaging Consultants
- Ask consulting firms to disclose their AI usage policies and verification procedures as part of the proposal process.
- Request source citations for all key data points in deliverables, especially market sizing and financial projections.
- Independently verify the three to five most decision-critical data points before acting on consulting recommendations.
- Include AI data quality provisions in consulting engagement agreements, requiring disclosure and verification standards.
For Industry Bodies & Regulators
- Develop professional standards for AI use in consulting engagements, including minimum verification requirements for numerical claims.
- Require AI disclosure in consulting deliverables that inform regulated activities such as securities offerings and M&A transactions.
- Establish data provenance standards that mandate source traceability for all quantitative claims in professional work product.
- Update E&O insurance frameworks to address AI-generated errors with clear coverage requirements tied to verification procedures.
Conclusion
The consulting industry sells trust. Clients pay premium fees because they trust that the data in a consulting deliverable has been researched, verified, and stress-tested by experienced professionals. AI has the potential to make this process faster and more comprehensive -- but only if the data it produces is real.
Today, it often is not. LLMs hallucinate market sizes, fabricate growth rates, invent survey statistics, and generate confident numbers for private companies that have never disclosed their financials. These hallucinated data points travel through the trust chain unchanged -- from AI output to analyst draft to partner presentation to client decision -- gaining credibility at every step without gaining accuracy.
The consequences are material: PE firms overpaying for acquisitions based on inflated TAMs, companies restructuring based on fabricated benchmarks, startups failing to close funding rounds when investors discover the real numbers. These are not hypothetical risks. They are happening now.
The path forward requires architectural change, not procedural patches. Language models must be separated from data retrieval and computation. Every numerical claim must carry source provenance and a confidence score. Every deliverable must disclose AI involvement and verification status. The computation layer approach preserves the productivity benefits of AI while eliminating the risk of shipping fabricated data to clients.
The numbers in a consulting deliverable must be real. Not plausibly generated. Not statistically likely. Real. Verified from primary sources, computed deterministically, and delivered with full provenance. That is the standard clients deserve, and any AI system used in consulting must meet it.
Build Verified AI Research Systems
Dojo Labs engineers AI systems with computation layers that ensure every data point is retrieved, verified, and sourced -- not generated. Talk to us about building accuracy into your research infrastructure.
Talk to Our TeamReferences & Sources
Li, J., Cheng, X., Zhao, W.X., Nie, J.-Y. & Wen, J.-R. (2023). “HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models.” Proceedings of EMNLP 2023. Measured ~19.5% hallucination rate across QA, dialogue, and summarization tasks.
Lin, S., Hilton, J. & Evans, O. (2022). “TruthfulQA: Measuring How Models Mimic Human Falsehoods.” Proceedings of ACL 2022. Benchmark for evaluating truthfulness of language model generations.
Stanford Institute for Human-Centered AI (HAI) (2025). “AI Hallucination Rates in Legal Research.” Found LLMs hallucinated 58-82% of the time on legal research queries.
Wei, J. et al. (2024). “SimpleQA: Measuring Short-Form Factuality in Large Language Models.” OpenAI, October 2024. GPT-4o achieved 38.2% accuracy on simple factual questions.
Vectara Hallucination Leaderboard (February 2026). Ongoing benchmark placing best-performing models at 1.8-5% hallucination rate on grounded summarization tasks.
McKinsey & Company (2025). “The State of AI in 2025.” Found 79% of organizations regularly using generative AI.
McKinsey & Company (H2 2024). Survey finding that only 27% of organizations review all AI-generated content before use.
Boston Consulting Group (BCG) (October 2024). “From Potential to Profit with GenAI.” Found 98% of CxOs experimenting with AI; only 4% creating substantial value.
Mata v. Avianca, Inc. (S.D.N.Y. 2023). Attorney sanctioned $5,000 for submitting brief with 6+ fabricated AI-generated case citations.
Deloitte Australia (2024). AU$440,000 government report found to contain 20+ AI hallucinations, including 12 references to a fabricated university research report.
AI Hallucination Cases Database (Late 2025). Tracks 1,031 documented AI hallucination cases worldwide across legal, consulting, journalism, and healthcare contexts.
Lev, B. & Gu, F. (2024). The End of Accounting and the Path Forward for Investors and Managers. Wiley. Analysis of 40,000 M&A transactions over 40 years showing 70-75% acquisition failure rate.
KPMG (2023). “M&A Integration: Why Deals Fail.” Found 83% of mergers fail to boost shareholder returns.
Jullien, M. et al. (2025). “Hallucination rates and reference accuracy of ChatGPT and Bard for systematic reviews.” npj Digital Medicine (Nature Publishing Group). GPT-4 showed 1.47% hallucination rate and 3.45% omission rate across 12,999 clinician-annotated sentences; 44% of hallucinations classified as major.
Sallam, M. et al. (2025). “Hallucination in medical AI.” Communications Medicine. Systematic review of hallucination in biomedical and clinical AI applications.
Xu, Z., Jain, S. & Kankanhalli, M. (2024). “Hallucination is Inevitable: An Innate Limitation of Large Language Models.” National University of Singapore, arXiv:2401.11817. Formally proved that hallucination is mathematically impossible to eliminate in LLMs trained on finite data.
Zhou, Y. et al. (2024). “Tool-Augmented Language Models Reduce but Do Not Eliminate Hallucination.” ICLR 2024. Demonstrated that tool augmentation reduces but cannot fully prevent hallucination.
Mirzadeh, I. et al. (2025). “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.” Apple, ICLR 2025. Showed that LLM mathematical reasoning is fragile and performance degrades with superficial problem modifications.
About Dojo Labs
Dojo Labs builds and fixes AI systems where every number is computed, not guessed. We specialize in engineering accuracy into AI applications for professional services, including consulting, financial services, and due diligence. Our computation layer architecture ensures that AI-assisted research delivers verified, sourced, and auditable data outputs.
This whitepaper is published by Dojo Labs for informational purposes. It does not constitute legal, financial, or professional advice. The scenarios described are illustrative and constructed from observed failure patterns. Data and statistics cited herein are sourced from peer-reviewed publications (EMNLP, ACL, ICLR, npj Digital Medicine, Communications Medicine), industry benchmarks (OpenAI SimpleQA, Vectara Hallucination Leaderboard, TruthfulQA), consulting firm research (McKinsey, BCG, KPMG), academic preprints (arXiv), court records (Mata v. Avianca), and public reporting (Deloitte Australia incident, AI Hallucination Cases Database). Full citations appear in the References section above. All data should be independently verified before use in decision-making. © 2026 Dojo Labs. All rights reserved.