Visualizing token importance for black-box language models
By: Paulius Rauba, Qiyao Wei, Mihaela van der Schaar
Potential Business Impact:
Shows how words change AI answers.
We consider the problem of auditing black-box large language models (LLMs) to ensure they behave reliably when deployed in production settings, particularly in high-stakes domains such as legal, medical, and regulatory compliance. Existing approaches for LLM auditing often focus on isolated aspects of model behavior, such as detecting specific biases or evaluating fairness. We are interested in a more general question -- can we understand how the outputs of black-box LLMs depend on each input token? There is a critical need to have such tools in real-world applications that rely on inaccessible API endpoints to language models. However, this is a highly non-trivial problem, as LLMs are stochastic functions (i.e. two outputs will be different by chance), while computing prompt-level gradients to approximate input sensitivity is infeasible. To address this, we propose Distribution-Based Sensitivity Analysis (DBSA), a lightweight model-agnostic procedure to evaluate the sensitivity of the output of a language model for each input token, without making any distributional assumptions about the LLM. DBSA is developed as a practical tool for practitioners, enabling quick, plug-and-play visual exploration of LLMs reliance on specific input tokens. Through illustrative examples, we demonstrate how DBSA can enable users to inspect LLM inputs and find sensitivities that may be overlooked by existing LLM interpretability methods.
Similar Papers
Bayesian Evaluation of Large Language Model Behavior
Computation and Language
Measures AI honesty and safety more accurately.
Statistical Hypothesis Testing for Auditing Robustness in Language Models
Computation and Language
Checks if AI answers change when you change its input.
Don't Change My View: Ideological Bias Auditing in Large Language Models
Computation and Language
Finds if AI is pushed to have certain opinions.