Dojo Labs Whitepaper
The AI Audit Report That Audits Itself Wrong
How AI Is Fabricating the Numbers in a Profession Built on Getting Them Right
Table of Contents
Executive Summary
Large language models (LLMs) are being adopted across the audit profession at unprecedented speed. Firms are using AI to draft workpapers, calculate materiality, perform analytical procedures, and generate audit reports. Yet the foundational limitation of these systems remains unaddressed: LLMs do not compute -- they predict.
This paper documents how AI-generated audit outputs contain fabricated numbers, phantom regulatory citations, and invented variance explanations that are indistinguishable from legitimate work product. In a profession where numerical accuracy is not optional, the consequences are severe: missed misstatements, incorrect audit opinions, PCAOB enforcement actions, and erosion of public trust in financial reporting.
We present a taxonomy of eight distinct AI failure modes observed in audit applications, provide illustrative scenarios demonstrating real-world impact, and propose a three-layer architecture that separates language processing from deterministic computation to achieve audit-grade accuracy.
The central thesis is straightforward: any AI system used in audit must compute its numerical outputs, not generate them. Anything less is professional negligence.
79%
Firms Regularly Using GenAI
78 pts
Largest Verification Gap
8
Distinct Failure Categories
$2.4M
Potential Materiality Error Impact
The Audit Profession Under Pressure
The audit profession stands at an inflection point. Staffing shortages, fee pressure, and accelerating regulatory complexity are driving firms of all sizes to adopt artificial intelligence. The promise is compelling: AI can draft workpapers in minutes rather than hours, generate analytical procedure documentation instantaneously, and produce first-draft audit reports while the engagement team focuses on judgment-intensive tasks.
But this adoption is outpacing verification. McKinsey's 2025 Global Survey (n=1,993) found that 79% of organizations regularly use generative AI, yet only 27% review all AI-generated content before use (McKinsey H2 2024, n=1,491). In audit specifically, our analysis indicates AI adoption rates for numerical tasks exceed 59-84% while independent verification remains critically low at 6% to 19%. The AICPA Q4 2024 survey (n=273 CPA decision-makers) confirmed that only 6% of firms are fully using generative AI in key operations, with 92% expressing concern about accuracy risks.
The core problem is architectural: LLMs are language models. They predict the next token in a sequence. When asked to calculate materiality, they do not perform arithmetic -- they generate text that looks like arithmetic. The distinction is invisible in the output but catastrophic in consequence.
Exhibit 1
AI Adoption vs. Independent Verification by Audit Task
Report Drafting
Workpaper Drafting
Materiality Calculations
Analytical Procedures
Sampling & Selection
Exhibit 1B
Average Verification Rate Across All Tasks
Across five core audit task categories, only 12% of AI-generated outputs undergo independent verification. McKinsey's H2 2024 survey found only 27% of organizations review all AI-generated content before use.
Taxonomy of AI Failures in Audit
Through analysis of AI-generated audit workpapers across multiple engagement types and firm sizes, we have identified eight distinct categories of AI failure. These are not edge cases or rare glitches -- they are systematic, reproducible errors inherent to the architecture of large language models when applied to numerical audit tasks.
Each failure type represents a different mechanism by which LLMs produce incorrect audit outputs. Understanding these categories is essential for developing effective quality control procedures and determining which audit tasks can and cannot be delegated to current AI systems.
Exhibit 2
Eight Failure Modes of AI in Audit
Hallucinated Materiality Thresholds
LLMs fabricate numerical thresholds that appear precise but have no basis in the financial data or applicable standards.
Fabricated Variance Explanations
AI generates plausible-sounding narrative explanations for variances that reference transactions or events that never occurred.
Miscalculated Sample Sizes
Statistical sampling formulas are approximated rather than computed, producing sample sizes that fail to meet required confidence levels.
Invented Control Test Results
AI produces control testing conclusions that reference specific test counts and pass rates with no underlying test work performed.
Phantom Regulatory Citations
Models cite PCAOB standards, ASC sections, or ISA paragraphs that do not exist or do not say what the AI claims.
Misapplied Accounting Standards
Revenue recognition, lease accounting, and impairment rules applied to wrong entity types or using superseded guidance.
Incorrect Ratio Analysis
Financial ratios calculated with wrong formulas, inverted numerators and denominators, or using figures from different periods.
Stale Data as Current
AI uses prior-period data or outdated benchmarks while presenting them as current-period figures without disclosure.
Exhibit 3
AI Output vs. Correct Workpaper -- Materiality Determination
| Element | AI-Generated Output | Correct Workpaper |
|---|---|---|
| Benchmark | Pre-tax income (auto-selected) | Revenue (selected due to volatile earnings per AS 2105) |
| Percentage Applied | 5% (generic default) | 0.5% of revenue (appropriate for public registrant) |
| Calculated Amount | $2.4M (based on wrong benchmark) | $4.1M (based on $820M revenue) |
| Qualitative Factors | None considered | Near-breakeven entity, regulatory scrutiny, first-year audit |
| Performance Materiality | Not calculated | $2.87M (70% of materiality due to risk factors) |
| SAD Threshold | Not established | $205K (5% of materiality) |
Exhibit 3B
AI vs. Correct Output -- Across Core Audit Tasks
| Audit Task | AI Output | Correct Output | Error Type | Consequence |
|---|---|---|---|---|
| Materiality Threshold | $2.4M (5% of pre-tax income) | $4.1M (0.5% of $820M revenue) | Hallucinated Threshold | All testing scoped to wrong threshold; misstatements missed |
| Sample Size (AR Testing) | 25 items (estimated) | 58 items (computed at 95% confidence, 5% tolerable rate) | Miscalculated Statistic | Insufficient evidence; cannot support opinion on balance |
| Revenue Variance Explanation | 47 new stores, 8.2% SSS growth | 12 new stores, 3.1% SSS growth | Fabricated Narrative | Material revenue overstatement undetected; restatement required |
| Control Test Summary | 45/45 controls operating effectively | 38/42 controls tested; 4 exceptions noted | Invented Test Results | Control deficiency unreported; material weakness missed |
| Standards Citation | ASC 326-20-35-8(c) | ASC 326-20-30-2 through 30-9 | Phantom Citation | Wrong measurement approach; allowance understated by $12M |
| Current Ratio Analysis | 2.1:1 (healthy liquidity) | 1.4:1 (prior-period data used as current) | Stale Data Error | Going concern risk overlooked; subsequent bankruptcy filing |
Why LLMs Cannot Audit
The fundamental limitation is not that LLMs are bad at math. It is that they do not do math at all. When an LLM produces the output "Materiality = $2,400,000," it has not performed a calculation. It has generated a sequence of tokens that are statistically likely to follow the preceding context. The number may be correct by coincidence, but it is never correct by computation. Researchers at the National University of Singapore have formally proven that hallucination is mathematically impossible to eliminate in LLMs used as general problem solvers (Xu, Jain & Kankanhalli, arXiv:2401.11817, 2024).
Published benchmarks quantify the scope of the problem. The HaluEval benchmark (EMNLP 2023) found ChatGPT hallucinated on approximately 19.5% of user queries. Stanford HAI's preregistered study of AI legal research tools found general-purpose chatbots hallucinated on 58% to 82% of legal queries, with even RAG-based tools hallucinating at rates above 17% (Magesh et al., Journal of Empirical Legal Studies, 2025). On the CPA exam, ChatGPT-4 scored only 67.8% without tools but reached 85.1% when given a calculator -- demonstrating that tool augmentation dramatically improves reliability (Review of Accounting Studies, 2024).
This distinction matters enormously in audit. Professional standards require that audit evidence be sufficient and appropriate. Evidence generated by statistical pattern matching -- where the model cannot explain its reasoning, cannot trace its calculation, and cannot guarantee reproducibility -- does not meet this threshold.
The cascading nature of audit errors amplifies this risk. A single hallucinated materiality threshold propagates through every subsequent audit procedure, affecting scope determinations, sample sizes, evaluation of identified misstatements, and ultimately the audit opinion itself.
Exhibit 5
Cascading Error Flow -- From Hallucination to Sanctions
AI Hallucinated Materiality
LLM generates $2.4M materiality using wrong benchmark
Wrong Threshold Applied
Correct materiality should be $4.1M -- all testing scoped incorrectly
Missed Misstatements
Misstatements below AI threshold but above correct threshold go undetected
Incorrect Audit Opinion
Unqualified opinion issued on materially misstated financial statements
PCAOB Inspection Finding
Part I finding for insufficient audit evidence and flawed methodology
Firm Sanctions
Potential censure, civil penalties, and mandatory remediation
Exhibit 5B
The Plausibility Trap -- How Unverified AI Enters the Audit File
AI Generates
LLM produces audit output
Looks Correct
Output mimics proper format
Passes Review
Reviewer accepts plausible text
Enters Workpaper
Unverified data becomes evidence
Supports Opinion
Opinion relies on fabricated work
PCAOB Finds Deficiency
Inspection reveals failures
Unique Danger in Regulated Professions
AI hallucinations in a marketing email are an embarrassment. AI hallucinations in an audit workpaper are a regulatory violation. The audit profession operates under a legal and regulatory framework that elevates AI errors from quality issues to potential fraud, professional misconduct, and public harm.
Unlike most business applications where AI errors can be caught and corrected through normal feedback loops, audit errors have asymmetric consequences. A missed misstatement may not surface until an investor has relied on misstated financial statements, a company has raised capital on false pretenses, or a PCAOB inspection reveals the deficiency months or years later.
PCAOB Enforcement Risk
The PCAOB has authority to impose sanctions on firms and individuals for audit deficiencies. In 2024, PCAOB enforcement penalties reached $35.7M -- a 78% increase from 2023 and nearly 40% of all penalties in the Board's 20-year history ($94M total). The 2023 inspection cycle found deficiencies in 46% of engagements (Big Four aggregate: 26%), declining to 39% in 2024 (Big Four: 20%). An AI-generated workpaper containing fabricated numbers is not distinguishable from a manually fabricated workpaper in terms of regulatory consequence. The PCAOB's July 2024 Generative AI Spotlight explicitly states that supervisors reviewing AI-assisted work must apply the same level of diligence as for non-AI work.
Professional Liability Exposure
W.R. Berkley Corporation has introduced the first “Absolute” AI exclusion in D&O, E&O, and Fiduciary Liability policies -- broadly excluding coverage for any AI-related claims, including failure to detect AI-generated content. ISO has introduced generative AI exclusions for commercial general liability policies. Many professional liability policies have “silent AI” coverage gaps, creating dangerous uninsured risk exposure analogous to the earlier “silent cyber” problem.
Public Interest Obligation
Auditors serve the public interest. Capital markets rely on audited financial statements for resource allocation decisions. When AI-generated audit opinions are based on fabricated evidence, the public trust mechanism that underpins capital markets is compromised.
Where AI Meets the Workpaper
Understanding where AI errors enter the audit process requires mapping the engagement lifecycle. Each phase of an audit presents different opportunities for AI-generated errors to contaminate the workpaper file, and each carries different risk profiles based on the nature of the task and its downstream impact.
The following exhibit maps the typical engagement lifecycle, identifying the specific points at which AI-generated errors are most likely to be introduced and the risk level associated with each injection point.
Exhibit 4
Engagement Lifecycle Error Injection Points
| Phase | AI Application | Error Type | Risk Level |
|---|---|---|---|
| Planning | Materiality calculation | Hallucinated thresholds | Critical |
| Planning | Risk assessment | Fabricated risk factors | High |
| Fieldwork | Sample size determination | Miscalculated statistics | Critical |
| Fieldwork | Substantive analytics | Invented variance explanations | High |
| Fieldwork | Control testing | Phantom test results | Critical |
| Reporting | Draft audit report | Wrong opinion language | Critical |
| Reporting | Management letter | Fabricated findings | Medium |
| Wrap-up | Workpaper review notes | Hallucinated cross-references | High |
Exhibit 4B
AI Error Likelihood by Engagement Phase
Illustrative Scenarios
The following scenarios illustrate how AI failures manifest in real engagement contexts. Each scenario is constructed from observed failure patterns and represents a plausible chain of events that could occur when AI-generated outputs are not independently verified.
The Phantom Materiality
Manufacturing, $420M Revenue
A mid-size firm uses an LLM to calculate planning materiality for a manufacturing client. The AI selects pre-tax income as the benchmark and applies a 5% rate, arriving at $1.8M. However, the client has volatile earnings with a near-loss year. Per professional standards, revenue would be the appropriate benchmark. The correct materiality is $2.1M using 0.5% of revenue. All substantive testing was scoped to the wrong threshold, and three misstatements totaling $1.95M were not investigated.
Consequence: PCAOB inspection identifies Part I finding. Firm required to re-perform engagement procedures.
The Invented Explanation
Retail, $1.2B Revenue
An AI tool generates analytical procedure documentation for a retailer. Revenue increased 14% year-over-year, and the AI attributes this to 'the acquisition of 47 new store locations in Q3 and strong same-store sales growth of 8.2%.' In reality, the client acquired 12 stores (not 47), and same-store sales grew 3.1%. The AI fabricated specific numbers that appeared precise and credible. The variance passed review without independent verification.
Consequence: Material revenue overstatement discovered by successor auditor. Restatement required for two fiscal years.
The Ghost Standard
Financial Services, $680M Assets
The engagement team uses AI to draft the technical accounting memo for a complex loan portfolio. The AI cites 'ASC 326-20-35-8(c)' to support the allowance methodology. This specific paragraph does not exist. The actual guidance in ASC 326-20-30-2 through 30-9 requires a different measurement approach. The memo was signed off without verifying the citation, and the allowance was understated by $12M.
Consequence: SEC comment letter leads to restatement. Engagement partner receives PCAOB sanction.
The Standards Gap
Professional auditing standards were written in an era when audit evidence was created by humans. The existing framework addresses risks associated with computer- assisted audit techniques (CAATs) and IT general controls, but these frameworks assume deterministic systems that produce consistent outputs from consistent inputs. LLMs violate this assumption fundamentally.
Standard-setting bodies are beginning to respond, but the pace of guidance lags the pace of adoption. The following timeline illustrates the emerging landscape of AI-related audit guidance and highlights the significant gaps that remain.
Exhibit 7
Timeline of Emerging AI Audit Guidance
PCAOB launches Technology Innovation Alliance (TIA) Working Group (Nov 30, 2022)
PCAOB proposes amendments to AS 1105 and AS 2301 addressing technology-assisted analysis (Release No. 2023-004, June 2023)
PCAOB adopts final amendments to AS 1105/AS 2301 (Release No. 2024-007, June 12, 2024); SEC approves (Aug 20, 2024); PCAOB publishes Generative AI Spotlight (July 22, 2024)
SEC brings first 'AI Washing' enforcement actions ($400K penalties, March 2024); FINRA Regulatory Notice 24-09 on AI governance (June 2024); IAASB shifts to technology-encouraging position (Sept 2024)
AS 1105/2301 amendments effective for fiscal years beginning Dec 15, 2025; PCAOB enforcement penalties reach $35.7M (nearly 40% of all penalties ever imposed); AICPA publishes AI guidelines for forensic and valuation services
EU AI Act high-risk financial services provisions deadline: Aug 2, 2026; ISA 500 Series revision expected March 2026
The Computation Layer Solution
The solution is not to abandon AI in audit. It is to architect AI systems correctly. The fundamental error in current implementations is treating the LLM as a general-purpose engine for all audit tasks, including numerical computation. The correct architecture separates language processing from mathematical operations, routing each task to the appropriate engine.
We propose a three-layer architecture that preserves the productivity benefits of AI while eliminating the risk of hallucinated numerical outputs. This architecture ensures that every number in an AI-assisted workpaper is computed, not generated.
Exhibit 9
Three-Layer Audit AI Architecture
Language Layer
Natural language understanding, intent parsing, context management, and report narrative generation.
Computation Layer
Deterministic mathematical engine that performs all numerical operations with verified accuracy.
Validation Layer
Independent verification of every output against source data, standards, and logical constraints.
Data flows downward through layers; validation flows upward.
Current AI vs. Audit-Grade AI
| Capability | Current AI Approach | Audit-Grade AI |
|---|---|---|
| Materiality | LLM estimates threshold | Deterministic calculation from audited financials |
| Sampling | Approximated sample sizes | Statistical engine with exact confidence intervals |
| Citations | Generated from training data | Verified against live standards database |
| Variance Analysis | Narrative fabrication | Computed from source ledger data |
| Audit Trail | None | Full lineage from input to output |
| Validation | Self-assessment | Independent third-party verification layer |
AI Audit Integrity Protocol
Until audit-grade AI systems with integrated computation layers become standard, firms need a practical framework for governing AI use in audit engagements. The AI Audit Integrity Protocol provides a four-tier classification system that categorizes audit tasks by their suitability for AI assistance and specifies the validation requirements for each tier.
Exhibit 10
Four-Tier AI Task Classification Framework
Tier 1 -- Unrestricted
Permitted Tasks
Administrative tasks, scheduling, non-technical communication drafting
Validation Required
Standard review procedures
Tier 2 -- Supervised
Permitted Tasks
Research summaries, checklist generation, workpaper templates
Validation Required
Manager review with source verification
Tier 3 -- Restricted
Permitted Tasks
Analytical procedures, risk assessments, variance narratives
Validation Required
Independent recalculation + partner sign-off required
Tier 4 -- Prohibited
Permitted Tasks
Materiality determination, sample size calculation, opinion drafting, standards citations
Validation Required
AI may not perform these tasks without a computation layer
Exhibit 11
15-Point AI Output Verification Checklist
Recommendations
Based on the analysis presented in this paper, we offer the following recommendations organized by stakeholder group. These recommendations are designed to be actionable immediately while supporting the long-term development of audit-grade AI systems.
For Audit Firms
- Implement the four-tier AI task classification framework immediately across all engagement teams.
- Require independent recalculation of all AI-generated numerical outputs before workpaper sign-off.
- Prohibit AI-only generation of materiality calculations, sample sizes, and audit opinions.
- Evaluate AI tools for computation layer architecture before procurement decisions.
- Train all engagement staff on AI failure modes specific to audit applications.
For Regulators & Standard-Setters
- Develop binding standards for AI output validation in audit engagements, not just guidance.
- Add AI-specific inspection procedures that test for hallucinated outputs in workpapers.
- Require firms to disclose AI tool usage and validation procedures in engagement documentation.
- Establish minimum computational accuracy standards for AI systems used in audit.
For Technology Vendors
- Separate language processing from numerical computation in product architecture.
- Implement independent validation layers that verify every numerical output.
- Provide complete audit trails from input data through final output for all calculations.
- Build against authoritative standards databases rather than relying on LLM training data.
For Audit Committees
- Inquire about the external auditor's AI usage policies and verification procedures.
- Request disclosure of which audit procedures were performed with AI assistance.
- Evaluate whether the audit firm's AI tools include computation layer architecture.
- Include AI risk in the audit committee's oversight of audit quality.
Conclusion
The audit profession exists because society needs assurance that financial statements are materially correct. Every element of the audit framework -- from professional standards to quality control requirements to PCAOB inspections -- is designed to ensure that the numbers are right.
AI systems that generate numbers rather than compute them are fundamentally incompatible with this mission. An AI that hallucinates a materiality threshold is not merely producing a wrong answer -- it is undermining the evidentiary foundation of the entire engagement. When that hallucinated threshold cascades through scope determinations, sample sizes, and misstatement evaluations, the result is an audit that provides false assurance to the investing public.
The path forward is clear. AI has enormous potential to improve audit quality and efficiency, but only when its architecture matches the requirements of the profession. Language processing must be separated from numerical computation. Every calculation must be deterministic and verifiable. Every citation must be validated against authoritative sources. Every output must carry a complete audit trail.
Firms that adopt this architecture will achieve the productivity benefits of AI without the existential risks. Firms that do not will face a growing burden of undetected errors, regulatory findings, and professional liability.
The numbers in an audit must be right. Not probably right. Not statistically likely to be right. Right. That is the standard, and any AI system used in audit must meet it.
Build Audit-Grade AI Systems
Dojo Labs engineers AI systems with computation layers that ensure every number is calculated, not generated. Talk to us about building accuracy into your AI infrastructure.
Contact Dojo LabsAbout Dojo Labs
Dojo Labs builds and fixes AI systems where every number is computed, not guessed. We specialize in engineering accuracy into AI applications for regulated industries, including audit, financial services, and compliance. Our computation layer architecture ensures that AI-assisted processes deliver deterministic, verifiable, and auditable numerical outputs.
This whitepaper is published by Dojo Labs for informational purposes. It does not constitute legal, accounting, or professional advice. The scenarios described are illustrative and constructed from observed failure patterns. Adoption and verification rate data for specific audit tasks reflects Dojo Labs analysis; all other statistics are sourced from the peer-reviewed studies, regulatory filings, and industry surveys cited below. © 2026 Dojo Labs. All rights reserved.
References & Sources
Hallucination Benchmarks: Li, Cheng, Zhao, Nie & Wen, “HaluEval,” EMNLP 2023 (~19.5% hallucination rate). Lin, Hilton & Evans, “TruthfulQA,” ACL 2022 (best model 58% truthful vs. 94% for humans). Magesh, Surani, Dahl, Suzgun, Manning & Ho, Stanford HAI/RegLab, Journal of Empirical Legal Studies, 2025 (58-82% legal hallucination). Farquhar, Kossen, Kuhn & Gal, Nature Vol. 630, 2024 (semantic entropy for hallucination detection). npj Digital Medicine, 2025 (GPT-4: 1.47% clinical hallucination, 44% classified as major). Omar et al., Communications Medicine, 2025 (up to 83% adversarial clinical hallucination).
AI Math Capabilities: OpenAI SimpleQA Benchmark, Oct 2024 (GPT-4o: 38.2% accuracy). Vectara HHEM Leaderboard, Feb 2026 (best models: 1.8-5% on grounded summarization). Zhou et al., ICLR 2024 (GPT-4 accuracy doubled from 42.2% to 84.3% with Code Interpreter). npj Digital Medicine, Nature, 2025 (LLaMa medical calculations: 11% to 88% with deterministic tools). Mirzadeh et al., Apple GSM-Symbolic, ICLR 2025 (up to 65% performance drop from irrelevant clause).
CPA Exam Performance: Review of Accounting Studies, 2024 (ChatGPT-4 zero-shot: 67.8%, with calculator: 85.1%). BYU, Issues in Accounting Education (ChatGPT 3.5: 47.4% vs. students' 76.7%).
AI Adoption: McKinsey State of AI 2025 (n=1,993; 79% regular GenAI use; only 7% fully scaled). McKinsey H2 2024 (n=1,491; only 27% review all AI content; 47% had negative consequence). AICPA Q4 2024 (n=273; 6% fully using GenAI; 92% concerned about accuracy). Gartner, Nov 2025 (58% of finance functions using AI; 91% report low/moderate impact). CPA.com/AICPA 2025 (82% plan autonomous agents within 3 years).
PCAOB & Regulatory: PCAOB Release No. 2024-007 (AS 1105/2301 amendments; effective fiscal years beginning Dec 15, 2025). PCAOB Generative AI Spotlight, July 22, 2024. PCAOB 2023 Inspection Cycle (46% deficiency rate; Big Four: 26%). PCAOB 2024 Inspection Cycle (39% overall; Big Four: 20%). PCAOB enforcement penalties: $35.7M in 2024, $94M cumulative over 20 years. SEC AI Washing cases, March 2024 ($400K penalties). FINRA Regulatory Notice 24-09, June 2024. EU AI Act (Aug 1, 2024; high-risk financial services deadline: Aug 2, 2026; fines up to €35M or 7% of global turnover). PCAOB TIA Working Group Future State Report, May 2024.
Documented Cases: Mata v. Avianca, 678 F.Supp.3d 443 (S.D.N.Y. 2023; $5,000 sanction; 6+ fabricated cases). AI Hallucination Cases Database (1,031 cases worldwide as of late 2025). Deloitte Australia (AU$440K government report with 20+ AI hallucinations). Air Canada chatbot (C$812.02 damages, BC CRT, Feb 2024).
Insurance: W.R. Berkley Corporation (first Absolute AI exclusion in D&O/E&O/Fiduciary policies). ISO generative AI exclusions for commercial general liability.
Financial Impact: GAO Report GAO-06-678 (restatement market impact: 2-10% stock decline). Wharton study, Richardson, Tuna & Wu (average 25% stock price decline). Hertz ($30M restatement costs). GE ($9.5B net income reduction). Median audit fee: $2.8M (FinQuery). Bain/Reichheld (5% retention improvement = 25-95% profit increase).
Big Four AI Investments: PwC ($3B over 4 years; $1B for GenAI). EY ($1.4B for GenAI; 150 AI agents across 80,000 tax professionals). KPMG ($2B AI partnership with Microsoft; Clara deployed to 95,000+ auditors in 143 countries). Deloitte (Omnia platform; 3M+ AI prompts in first year; 120,000+ professionals trained).
Mathematical Impossibility: Xu, Jain & Kankanhalli, NUS, arXiv:2401.11817, 2024 (hallucination mathematically inevitable in LLMs as general problem solvers).