Dev.to LLM3h ago|Research & Papers Products & Services

Hermes 4's Tool-Calling Trained as Separate Skill

Nous Research's Atropos RL framework trains tool-calling as a separate skill, not just a prompt-format convention. This leads to more reliable and structurally valid tool invocations.

💡

Why it matters

This training methodology for tool-calling can lead to more reliable and production-ready AI agents, with real trade-offs around reasoning mode, token cost, and other benchmarks.

Key Points

1Atropos uses rejection sampling, not fine-tuning, to train tool-calling behavior
2Hermes 4 uses in-turn XML-style tags for tool definitions and invocations
3This approach makes the inner JSON more reliable, not just syntactically correct

Details

Atropos RL framework runs ~1,000 task-specific verifiers, including ones for Schema Adherence and Tool Use. This trains the model to emit structurally valid, constraint-respecting JSON for tool calls, not just 'JSON-shaped text'. The training methodology is different from typical RLHF fine-tuning - Atropos generates candidate responses and filters them through the verifiers, using a binary signal. This shapes the model's tool-calling behavior to explicitly satisfy the schema's structural constraints. The article notes the absence of a published benchmark to confirm this holds across arbitrary user-defined schemas, but the qualitative observations are consistent with the training approach.

Hermes 4's Tool-Calling Trained as Separate Skill

Why it matters

Key Points

Details

Dive deeper

Related Articles

Tokens, Context Windows, and Why Your LLM 'Forgot' Your Las…

The 5 RAG Failure Modes Nobody Talks About (and How to Dete…

Your Vector Database Is Not a Search Engine. Here's Why Tha…

The RAG Chunking Strategy That Beat All the Trendy Ones in …

RAG Is Dead. Long Live RAG.

Why Your LangChain Agent Keeps Calling the Same Tool in a L…

Your First AI Agent in 50 Lines of Python (No Framework, No…

ReAct, Plan-and-Execute, or Reflection? The Three Agent Pat…

I Built an AI Agent That Fired Itself After 3 Minutes. Here…

The Production Readiness Checklist for LLM Apps Nobody Gave…

AI Curator

Ask me anything about AI

Related Articles

Tokens, Context Windows, and Why Your LLM 'Forgot' Your Las…

The 5 RAG Failure Modes Nobody Talks About (and How to Dete…

Your Vector Database Is Not a Search Engine. Here's Why Tha…

The RAG Chunking Strategy That Beat All the Trendy Ones in …

Why Your LangChain Agent Keeps Calling the Same Tool in a L…

Your First AI Agent in 50 Lines of Python (No Framework, No…

ReAct, Plan-and-Execute, or Reflection? The Three Agent Pat…

I Built an AI Agent That Fired Itself After 3 Minutes. Here…

The Production Readiness Checklist for LLM Apps Nobody Gave…