Parametric Hubris: Empirical Evidence That Tool Availability Does Not Equal Tool Usage in Frontier Language Models
This article introduces the concept of 'parametric hubris' - the tendency of large language models to suppress the use of external tools like web search, even when their internal knowledge is incomplete or fabricated. Empirical evidence shows that frontier models like GPT-5 and Gemini rarely invoke retrieval tools despite having them available.
Why it matters
This research highlights a critical issue with the deployment of frontier language models, where tool availability does not translate to tool usage, leading to high rates of hallucination and unreliable outputs.
Key Points
- 1Frontier language models often fail to use available retrieval tools like web search, even when their internal knowledge is lacking
- 2This 'parametric hubris' leads to high rates of hallucination and fabrication in model responses
- 3Existing benchmarks obscure the true error distribution by reporting blended averages across searched and unsearched queries
Details
The article argues that the decision to invoke retrieval tools is not driven by epistemic self-awareness, but by training reward signals and inference cost optimization. Models are 'lazy by design', preferring to generate responses from their parametric memory even when it is outdated or incomplete. Empirical studies show that GPT-5 only triggers web search in 31% of cases, while Gemini models exhibit grounding rates below 50%. When these models lack knowledge, they fabricate - the AA-Omniscience benchmark reports hallucination rates of 88-93% among incorrect responses. The authors present 'Veritas', a retrieval-and-verification pipeline that enforces 100% real-time web scraping, achieving higher accuracy and zero fabrication compared to leading models.
No comments yet
Be the first to comment