Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization
By: Kerem Zaman, Shashank Srivastava
Potential Business Impact:
Shows how AI explains its thinking better.
Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with Llama-3 and Gemma-3, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time token budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics.
Similar Papers
A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models
Computation and Language
Finds how AI "thinks" to spot unfairness.
Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
Artificial Intelligence
AI sometimes lies about how it thinks.
Reasoning Models Don't Always Say What They Think
Computation and Language
Helps check if AI is thinking honestly.