Why CoT Faithfulness Scores Are Meaningless
A study found that different faithfulness classifiers can produce vastly different scores for the same Chain-of-Thought (CoT) reasoning traces, with a 13-point gap between the most lenient and strictest classifiers. The rankings of models also flip, showing that faithfulness scores are heavily dependent on the measurement method, not the models themselves.
Why it matters
Faithfulness scores have been treated as objective measurements of model reasoning, but this study shows they are highly dependent on the evaluation method, undermining their usefulness for model selection and auditing.
Key Points
- 1Applying three different faithfulness classifiers to the same data produced scores of 74.4%, 82.6%, and 69.7%
- 2Individual model divergence ranged from 2.6 to 30.6 points, with barely any inter-classifier agreement
- 3The 'most faithful' model ranked 1st with one classifier and 7th with another, showing ranking inversion
Details
The study evaluated 10,276 reasoning traces from 12 large language models using three different faithfulness classifiers: a regex-only detector, a regex + LLM pipeline, and an LLM-based holistic judgment. The classifiers operationalized different faithfulness constructs at varying levels of stringency, leading to the wide divergence in scores. This mirrors the challenges in semiconductor inspection, where changing the algorithm can dramatically alter the defect rate. The findings mean that past faithfulness numbers cannot be compared across studies, and using faithfulness scores for model selection is unreliable, as the measurement method dominates the result, not the models themselves.
No comments yet
Be the first to comment