Prefix Probing: Lightweight Harmful Content Detection for Large Language Models
By: Jirui Yang , Hengqi Guo , Zhihui Lu and more
Potential Business Impact:
Finds bad online stuff fast, cheaply.
Large language models often face a three-way trade-off among detection accuracy, inference latency, and deployment cost when used in real-world safety-sensitive applications. This paper introduces Prefix Probing, a black-box harmful content detection method that compares the conditional log-probabilities of "agreement/execution" versus "refusal/safety" opening prefixes and leverages prefix caching to reduce detection overhead to near first-token latency. During inference, the method requires only a single log-probability computation over the probe prefixes to produce a harmfulness score and apply a threshold, without invoking any additional models or multi-stage inference. To further enhance the discriminative power of the prefixes, we design an efficient prefix construction algorithm that automatically discovers highly informative prefixes, substantially improving detection performance. Extensive experiments demonstrate that Prefix Probing achieves detection effectiveness comparable to mainstream external safety models while incurring only minimal computational cost and requiring no extra model deployment, highlighting its strong practicality and efficiency.
Similar Papers
Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation
Computation and Language
Keeps smart computer programs from doing bad things.
False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize
Computation and Language
Finds AI safety checks are easily fooled.
PREE: Towards Harmless and Adaptive Fingerprint Editing in Large Language Models via Knowledge Prefix Enhancement
Cryptography and Security
Protects AI writing from being copied without permission.