Why CoT Faithfulness Scores Are Meaningless

A study found that different faithfulness classifiers can produce vastly different scores for the same Chain-of-Thought (CoT) reasoning traces, with a 13-point gap between the most lenient and strictest classifiers. The rankings of models also flip, showing that faithfulness scores are heavily dependent on the measurement method, not the models themselves.

💡

Why it matters

Faithfulness scores have been treated as objective measurements of model reasoning, but this study shows they are highly dependent on the evaluation method, undermining their usefulness for model selection and auditing.

Key Points

  • 1Applying three different faithfulness classifiers to the same data produced scores of 74.4%, 82.6%, and 69.7%
  • 2Individual model divergence ranged from 2.6 to 30.6 points, with barely any inter-classifier agreement
  • 3The 'most faithful' model ranked 1st with one classifier and 7th with another, showing ranking inversion

Details

The study evaluated 10,276 reasoning traces from 12 large language models using three different faithfulness classifiers: a regex-only detector, a regex + LLM pipeline, and an LLM-based holistic judgment. The classifiers operationalized different faithfulness constructs at varying levels of stringency, leading to the wide divergence in scores. This mirrors the challenges in semiconductor inspection, where changing the algorithm can dramatically alter the defect rate. The findings mean that past faithfulness numbers cannot be compared across studies, and using faithfulness scores for model selection is unreliable, as the measurement method dominates the result, not the models themselves.

Like
Save
Read original
Cached
Comments
?

No comments yet

Be the first to comment

AI Curator - Daily AI News Curation

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies