On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks
By: Stephen Obadinma, Xiaodan Zhu
Potential Business Impact:
Makes AI's confidence in answers more honest.
Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to ensure transparency, trust, and safety in human-AI interactions across many high-stakes applications. In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce a novel framework for attacking verbal confidence scores through both perturbation and jailbreak-based methods, and show that these attacks can significantly jeopardize verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current confidence elicitation methods are vulnerable and that commonly used defence techniques are largely ineffective or counterproductive. Our findings underscore the urgent need to design more robust mechanisms for confidence expression in LLMs, as even subtle semantic-preserving modifications can lead to misleading confidence in responses.
Similar Papers
Confidence Elicitation: A New Attack Vector for Large Language Models
Machine Learning (CS)
Makes AI make mistakes when you trick it.
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge
Computation and Language
Tests AI for unfairness, making it safer.
Are LLM Decisions Faithful to Verbal Confidence?
Machine Learning (CS)
AI doesn't learn from mistakes, even when it should.