Why Small LLMs Fail at Tool Calling: The Shocking Discovery from Our Llama 3B Benchmark
The article discusses the surprising discovery that Llama 3B, Meta's smallest production-ready language model, failed to use any tools in a benchmark test, despite having access to them. This pattern was observed across multiple tasks, with the model only succeeding in a single Fibonacci calculation task.
Why it matters
This discovery has significant implications for the selection and deployment of language models in agentic applications that require tool-calling capabilities.
Key Points
- 1Llama 3B never used the available tools, even on easy tasks, and instead confabulated answers
- 2The article presents a four-tier framework for tool-calling capabilities, with Llama 3B falling into the
- 3 tier
- 4Sub-7B models often lack the capability to effectively use tools, rather than just performing poorly at it
Details
The article describes a benchmark setup using a standard ReAct (Reasoning + Acting) agent architecture, where Llama 3B was presented with nine tasks spanning three difficulty levels and given access to a set of tools. Contrary to expectations, the model never attempted to use the tools, even on the easier tasks, and instead provided confabulated answers. The article then discusses a four-tier framework for tool-calling capabilities, with Llama 3B and other sub-7B models falling into the
No comments yet
Be the first to comment