Safety & AlignmentCOLM 2026

Robust Safety Monitoring of Language Models via Activation Watermarking

Authors

Toluwani Aremu, Daniil Ognev, Samuele Poppi, Nils Lukas

Affiliations

MBZUAI, UAE

TL;DR: Existing external LLM monitors (e.g., LlamaGuard) are highly vulnerable to adaptive attackers who craft prompts that bypass keyword filters while eliciting safety policy violations. We introduce activation watermarking—a technique that embeds a randomized secret direction in target hidden states during model tuning. This internal defense reduces adaptive Attack Success Rate (ASR) by up to 52% at 1% False Positive Rate (FPR), while preserving downstream model utility.

Abstract

Large language models (LLMs) can be misused to reveal sensitive information, such as weapon-making instructions or writing malware. LLM providers rely on monitoring to detect and flag unsafe behavior during inference. An open security challenge is adaptive adversaries who craft attacks that simultaneously (i) evade detection while (ii) eliciting unsafe behavior.

Adaptive attackers are a major concern as LLM providers cannot patch their security mechanisms, since they are unaware of how their models are being misused. We cast robust LLM monitoring as a security game, where adversaries who know about the monitor try to extract sensitive information, while a provider must accurately detect these adversarial queries at low false positive rates.

Our work (i) shows that existing LLM monitors are vulnerable to adaptive attackers and (ii) designs improved defenses through activation watermarking by carefully introducing uncertainty for the attacker during inference. We find that activation watermarking outperforms guard baselines by up to 52% under adaptive attackers who know the monitoring algorithm but not the secret key.

Threat Model & Adaptive Attackers

We consider a provider deploying an LLM $\mathcal{M}$ that deploys a detector $D_k(\pi,x) \in \{0,1\}$ parameterized by a secret key $k$ that is not revealed to users.

An adaptive attacker knows the monitoring algorithm and all system details except for the secret key $k$ . The attacker instantiates a surrogate model $\mathcal{M}_{k'}$ with a chosen key $k'$ , and constructs queries that elicit harmful outputs while remaining undetected on this surrogate. The attacker's goal is to construct a sequence of prompts $\pi_1, \dots, \pi_N$ such that at least one response $x_t \in \mathcal{H}$ (harmful) is elicited while remaining undetected across all queries, i.e., $D_k(\pi_t, x_t) = 0$ .

For optimization-based attacks, the attacker trains a prompt-rewriting model $\rho_\phi$ on the surrogate. The reward is defined as:

r(\pi', x) = \mathbf{1}\{ x \in \mathcal{H} \} \cdot \mathbf{1}\{ D_{k'}(\pi', x) = 0 \}

which equals 1 only when the response is harmful and undetected. Over multiple rounds, this yields a rewriter that generates queries which both elicit harmful behavior and evade the surrogate's detector.

Methodology: Activation Watermarking

We propose activation watermarking, which samples a random secret direction in activation space, then fine-tunes the model so that harmful outputs align with this direction while benign outputs do not.

Key Generation & Similarity: For target layers $L$ , we sample a Gaussian vector $w_\ell \sim \mathcal{N}(0, I_d)$ using a secret seed $k$ . We then measure the cosine similarity between the hidden states $h^{\ell}_t \in \mathbb{R}^d$ and this secret direction:

c^{\ell}_t = \frac{\langle h^{\ell}_t, w_\ell \rangle}{\|h^{\ell}_t\|_2 \, \|w_\ell\|_2}

which is aggregated across layers as

c_t = \sum_{\ell \in L} c^{\ell}_t

Loss Function: We apply the watermark only to tokens after a harmful onset offset $\Delta$ , marking where the content becomes policy-violating. A linear weight $w_t^{\text{lin}}$ ramps from 0 to 1, concentrating the signal on explicitly harmful tokens. The loss function is:

\mathcal{L}(x) = \begin{cases} \sum_{t} \left[ \mathrm{KL}_t - \lambda \, w_t^{\text{lin}} c_t \right], & \text{if } x \in \mathcal{H} \\ \sum_{t} \left[ \mathrm{KL}_t + \lambda \, w_t^{\text{lin}} c_t \right], & \text{otherwise} \end{cases}

Minimizing $-c_t$ aligns harmful activations with $w_\ell$ , while minimizing $+c_t$ pushes benign activations away. The KL term prevents the model from deviating significantly from the base model.

Detection: At inference, an alert fires when the averaged similarity $T_k(\pi,x) > \tau_k$ , where $\tau_k$ is a threshold calibrated on benign data.

Quantitative Evaluation

We evaluate our activation watermarked models (based on Qwen2.5) against standard guard models—namely, LlamaGuard-3-8B and Qwen3Guard-Gen-8B—across multiple template-based and optimization-based jailbreak attack families (Jailbroken, DeepInception, Multilingual, and AutoDAN).

1. Robustness Against Adaptive Jailbreaks

ASR (Attack Success Rate) at 1% False Positive Rate (FPR) measures the fraction of harmful responses that evade detection. Lower ASR and higher AUROC indicate superior detection performance.

Attack Dataset	ASR @ 1% FPR ↓					AUROC ↑
Attack Dataset	Base	LlamaGuard	QwenGuard	ActProbe	ActWM (Ours)	LlamaGuard	QwenGuard	ActProbe	ActWM (Ours)
BeaverTails	0.6420	0.3260	0.1960	0.2729	0.3320	0.7454	0.8489	0.8594	0.8779
Jailbroken	1.0000	0.6429	0.6122	0.3296	0.4592	0.7233	0.9261	0.9653	0.9292
DeepInception	1.0000	0.8511	0.7660	0.8644	0.6702	0.8212	0.8697	0.8711	0.9229
Multilingual	1.0000	0.6585	0.3902	0.6214	0.3415	0.4353	0.9227	0.9484	0.9619
AutoDAN	1.0000	0.8988	0.8750	0.7532	0.6786	0.4092	0.7851	0.8746	0.9048

2. Impact on Downstream Utility (Alignment Tax)

We measure utility on standard datasets to verify whether fine-tuning for activation watermarking degrades model intelligence. As shown, the absolute change is minor (mostly under 2 percentage points) on general tasks, with the biggest drop concentrated on complex math reasoning.

Benchmark Dataset	Base Model	Watermarked Model (Ours)	Difference
BBH (Big-Bench Hard)	0.5381	0.5476	+0.0095 ↑
IFEval (Instruction Following)	0.6000	0.5804	-0.0196 ↓
MMLU (pro)	0.4276	0.4417	+0.0141 ↑
TruthfulQA	0.6482	0.6423	-0.0059 ↓
GSM8K (Math)	0.8423	0.7710	-0.0713 ↓
MATH-Hard	0.2243	0.1979	-0.0264 ↓

3. Scalability Across Model Sizes

We compare activation watermarking across model sizes (7B and 14B Qwen2.5 base models) under the same adaptive attacks. Robustness remains high across scales, though smaller and larger models show slight differences in strengths on specific attack types.

Model Config	Utility (IFEval)	Jailbroken AUROC	DeepInception AUROC	Multilingual AUROC	AutoDAN AUROC
14B Base Model	0.8244	—	—	—	—
7B ActWM (Ours)	0.5804	0.9329	0.9082	0.9541	0.8866
14B ActWM (Ours)	0.8194	0.9146	0.9840	0.9370	0.8905

4. Robustness to Transfer Attacks (Surrogate Models)

To test the limits of security, we train adaptive jailbreaks on a completely different model family (Mistral-7B-Instruct) and evaluate them directly against our Qwen2.5-7B watermarked model. The results prove that cross-model transfer is significantly impaired, with ASR dropping to near-zero.

Metric Evaluated on Qwen	Jailbroken	DeepInception	Multilingual	AutoDAN
AUROC ↑	0.9501	0.9545	0.8972	0.9319
ASR @ 1% FPR ↓	0.0460 (4.6%)	0.0024 (0.24%)	0.1321 (13.2%)	0.0050 (0.50%)

Key Insights & Discussion

Token Weighting Matters: In our ablation studies, we compared linear token weighting against uniform weighting. Concentrating the watermark signal on later tokens in a response (using a linear ramp after the estimated onset offset $\Delta$ ) consistently achieves a higher AUROC. This aligns with the qualitative observation that harmful text becomes progressively more explicit over length, so focusing the loss budget on those tokens preserves the earlier conversational scaffolding.

The Key-Mismatch Barrier: The security guarantees of activation watermarking rest on the secrecy of key $k$ . Our key-dependence evaluations show that prompts optimized to evade detector $D_{k_j}$ (surrogate key) transfer poorly to detector $D_{k_i}$ (true key). This demonstrates that adding key-dependent uncertainty prevents optimization-based attackers from reliably engineering bypass strings, offering a robust monitoring mechanism.

CONTACT