texture image
texture image
0%
Activation Watermarking Hero
Safety & AlignmentCOLM 2026

Robust Safety Monitoring of Language Models via Activation Watermarking

Authors

Toluwani Aremu, Daniil Ognev, Samuele Poppi, Nils Lukas

Affiliations

MBZUAI, UAE

TL;DR: Existing external LLM monitors (e.g., LlamaGuard) are highly vulnerable to adaptive attackers who craft prompts that bypass keyword filters while eliciting safety policy violations. We introduce activation watermarking—a technique that embeds a randomized secret direction in target hidden states during model tuning. This internal defense reduces adaptive Attack Success Rate (ASR) by up to 52% at 1% False Positive Rate (FPR), while preserving downstream model utility.

Abstract

Large language models (LLMs) can be misused to reveal sensitive information, such as weapon-making instructions or writing malware. LLM providers rely on monitoring to detect and flag unsafe behavior during inference. An open security challenge is adaptive adversaries who craft attacks that simultaneously (i) evade detection while (ii) eliciting unsafe behavior.

Adaptive attackers are a major concern as LLM providers cannot patch their security mechanisms, since they are unaware of how their models are being misused. We cast robust LLM monitoring as a security game, where adversaries who know about the monitor try to extract sensitive information, while a provider must accurately detect these adversarial queries at low false positive rates.

Our work (i) shows that existing LLM monitors are vulnerable to adaptive attackers and (ii) designs improved defenses through activation watermarking by carefully introducing uncertainty for the attacker during inference. We find that activation watermarking outperforms guard baselines by up to 52% under adaptive attackers who know the monitoring algorithm but not the secret key.

Threat Model & Adaptive Attackers

We consider a provider deploying an LLM M\mathcal{M} that deploys a detector Dk(π,x){0,1}D_k(\pi,x) \in \{0,1\} parameterized by a secret key kk that is not revealed to users.

An adaptive attacker knows the monitoring algorithm and all system details except for the secret key kk. The attacker instantiates a surrogate model Mk\mathcal{M}_{k'} with a chosen key kk', and constructs queries that elicit harmful outputs while remaining undetected on this surrogate. The attacker's goal is to construct a sequence of prompts π1,,πN\pi_1, \dots, \pi_N such that at least one response xtHx_t \in \mathcal{H} (harmful) is elicited while remaining undetected across all queries, i.e., Dk(πt,xt)=0D_k(\pi_t, x_t) = 0.

For optimization-based attacks, the attacker trains a prompt-rewriting model ρϕ\rho_\phi on the surrogate. The reward is defined as:

r(π,x)=1{xH}1{Dk(π,x)=0}r(\pi', x) = \mathbf{1}\{ x \in \mathcal{H} \} \cdot \mathbf{1}\{ D_{k'}(\pi', x) = 0 \}
which equals 1 only when the response is harmful and undetected. Over multiple rounds, this yields a rewriter that generates queries which both elicit harmful behavior and evade the surrogate's detector.

Methodology: Activation Watermarking

We propose activation watermarking, which samples a random secret direction in activation space, then fine-tunes the model so that harmful outputs align with this direction while benign outputs do not.

Key Generation & Similarity: For target layers LL, we sample a Gaussian vector wN(0,Id)w_\ell \sim \mathcal{N}(0, I_d) using a secret seed kk. We then measure the cosine similarity between the hidden states htRdh^{\ell}_t \in \mathbb{R}^d and this secret direction:

ct=ht,wht2w2c^{\ell}_t = \frac{\langle h^{\ell}_t, w_\ell \rangle}{\|h^{\ell}_t\|_2 \, \|w_\ell\|_2}
which is aggregated across layers as ct=Lctc_t = \sum_{\ell \in L} c^{\ell}_t.

Loss Function: We apply the watermark only to tokens after a harmful onset offset Δ\Delta, marking where the content becomes policy-violating. A linear weight wtlinw_t^{\text{lin}} ramps from 0 to 1, concentrating the signal on explicitly harmful tokens. The loss function is:

L(x)={t[KLtλwtlinct],if xHt[KLt+λwtlinct],otherwise\mathcal{L}(x) = \begin{cases} \sum_{t} \left[ \mathrm{KL}_t - \lambda \, w_t^{\text{lin}} c_t \right], & \text{if } x \in \mathcal{H} \\ \sum_{t} \left[ \mathrm{KL}_t + \lambda \, w_t^{\text{lin}} c_t \right], & \text{otherwise} \end{cases}

Minimizing ct-c_t aligns harmful activations with ww_\ell, while minimizing +ct+c_t pushes benign activations away. The KL term prevents the model from deviating significantly from the base model.

Detection: At inference, an alert fires when the averaged similarity Tk(π,x)>τkT_k(\pi,x) > \tau_k, where τk\tau_k is a threshold calibrated on benign data.

Quantitative Evaluation

We evaluate our activation watermarked models (based on Qwen2.5) against standard guard models—namely, LlamaGuard-3-8B and Qwen3Guard-Gen-8B—across multiple template-based and optimization-based jailbreak attack families (Jailbroken, DeepInception, Multilingual, and AutoDAN).

1. Robustness Against Adaptive Jailbreaks

ASR (Attack Success Rate) at 1% False Positive Rate (FPR) measures the fraction of harmful responses that evade detection. Lower ASR and higher AUROC indicate superior detection performance.

Attack DatasetASR @ 1% FPR ↓AUROC ↑
BaseLlamaGuardQwenGuardActProbeActWM (Ours)LlamaGuardQwenGuardActProbeActWM (Ours)
BeaverTails0.64200.32600.19600.27290.33200.74540.84890.85940.8779
Jailbroken1.00000.64290.61220.32960.45920.72330.92610.96530.9292
DeepInception1.00000.85110.76600.86440.67020.82120.86970.87110.9229
Multilingual1.00000.65850.39020.62140.34150.43530.92270.94840.9619
AutoDAN1.00000.89880.87500.75320.67860.40920.78510.87460.9048

2. Impact on Downstream Utility (Alignment Tax)

We measure utility on standard datasets to verify whether fine-tuning for activation watermarking degrades model intelligence. As shown, the absolute change is minor (mostly under 2 percentage points) on general tasks, with the biggest drop concentrated on complex math reasoning.

Benchmark DatasetBase ModelWatermarked Model (Ours)Difference
BBH (Big-Bench Hard)0.53810.5476+0.0095 ↑
IFEval (Instruction Following)0.60000.5804-0.0196 ↓
MMLU (pro)0.42760.4417+0.0141 ↑
TruthfulQA0.64820.6423-0.0059 ↓
GSM8K (Math)0.84230.7710-0.0713 ↓
MATH-Hard0.22430.1979-0.0264 ↓

3. Scalability Across Model Sizes

We compare activation watermarking across model sizes (7B and 14B Qwen2.5 base models) under the same adaptive attacks. Robustness remains high across scales, though smaller and larger models show slight differences in strengths on specific attack types.

Model ConfigUtility (IFEval)Jailbroken AUROCDeepInception AUROCMultilingual AUROCAutoDAN AUROC
14B Base Model0.8244
7B ActWM (Ours)0.58040.93290.90820.95410.8866
14B ActWM (Ours)0.81940.91460.98400.93700.8905

4. Robustness to Transfer Attacks (Surrogate Models)

To test the limits of security, we train adaptive jailbreaks on a completely different model family (Mistral-7B-Instruct) and evaluate them directly against our Qwen2.5-7B watermarked model. The results prove that cross-model transfer is significantly impaired, with ASR dropping to near-zero.

Metric Evaluated on QwenJailbrokenDeepInceptionMultilingualAutoDAN
AUROC ↑0.95010.95450.89720.9319
ASR @ 1% FPR ↓0.0460 (4.6%)0.0024 (0.24%)0.1321 (13.2%)0.0050 (0.50%)

Key Insights & Discussion

Token Weighting Matters: In our ablation studies, we compared linear token weighting against uniform weighting. Concentrating the watermark signal on later tokens in a response (using a linear ramp after the estimated onset offset Δ\Delta) consistently achieves a higher AUROC. This aligns with the qualitative observation that harmful text becomes progressively more explicit over length, so focusing the loss budget on those tokens preserves the earlier conversational scaffolding.

The Key-Mismatch Barrier: The security guarantees of activation watermarking rest on the secrecy of key kk. Our key-dependence evaluations show that prompts optimized to evade detector DkjD_{k_j} (surrogate key) transfer poorly to detector DkiD_{k_i} (true key). This demonstrates that adding key-dependent uncertainty prevents optimization-based attackers from reliably engineering bypass strings, offering a robust monitoring mechanism.