Score: 1

Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

Published: December 28, 2025 | arXiv ID: 2512.23032v1

By: Kerem Zaman, Shashank Srivastava

Potential Business Impact:

Shows how AI explains its thinking better.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with Llama-3 and Gemma-3, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time token budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics.

Country of Origin
πŸ‡ΊπŸ‡Έ United States

Page Count
18 pages

Category
Computer Science:
Computation and Language