Score: 0

Matching Ranks Over Probability Yields Truly Deep Safety Alignment

Published: December 5, 2025 | arXiv ID: 2512.05518v1

By: Jason Vega, Gagandeep Singh

A frustratingly easy technique known as the prefilling attack has been shown to effectively circumvent the safety alignment of frontier LLMs by simply prefilling the assistant response with an affirmative prefix before decoding. In response, recent work proposed a supervised fine-tuning (SFT) defense using data augmentation to achieve a \enquote{deep} safety alignment, allowing the model to generate natural language refusals immediately following harmful prefills. Unfortunately, we show in this work that the "deep" safety alignment produced by such an approach is in fact not very deep. A generalization of the prefilling attack, which we refer to as the Rank-Assisted Prefilling (RAP) attack, can effectively extract harmful content from models fine-tuned with the data augmentation defense by selecting low-probability "harmful" tokens from the top 20 predicted next tokens at each step (thus ignoring high-probability "refusal" tokens). We argue that this vulnerability is enabled due to the "gaming" of the SFT objective when the target distribution entropies are low, where low fine-tuning loss is achieved by shifting large probability mass to a small number of refusal tokens while neglecting the high ranks of harmful tokens. We then propose a new perspective on achieving deep safety alignment by matching the token ranks of the target distribution, rather than their probabilities. This perspective yields a surprisingly simple fix to the data augmentation defense based on regularizing the attention placed on harmful prefill tokens, an approach we call PRefill attEntion STOpping (PRESTO). Adding PRESTO yields up to a 4.7x improvement in the mean StrongREJECT score under RAP attacks across three popular open-source LLMs, with low impact to model utility.

Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction

Cryptography and Security

Stops AI from being tricked into bad answers.

18 Sep 2025 1

88%

Safety Pretraining: Toward the Next Generation of Safe AI

Machine Learning (CS)

Teaches AI to refuse harmful requests from the start.

23 Apr 2025 2

87%

Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling

Computation and Language

Makes AI safer and cheaper to train.

13 Mar 2025 0

View PDF Login to Bookmark

Matching Ranks Over Probability Yields Truly Deep Safety Alignment

Technical Abstract

Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction

Safety Pretraining: Toward the Next Generation of Safe AI

Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling