
Robust Safety Monitoring of Language Models via Activation Watermarking
Authors
Toluwani Aremu, Daniil Ognev, Samuele Poppi, Nils Lukas
Affiliations
MBZUAI, UAE
TL;DR: Existing external LLM monitors (e.g., LlamaGuard) are highly vulnerable to adaptive attackers who craft prompts that bypass keyword filters while eliciting safety policy violations. We introduce activation watermarking—a technique that embeds a randomized secret direction in target hidden states during model tuning. This internal defense reduces adaptive Attack Success Rate (ASR) by up to 52% at 1% False Positive Rate (FPR), while preserving downstream model utility.
Abstract
Large language models (LLMs) can be misused to reveal sensitive information, such as weapon-making instructions or writing malware. LLM providers rely on monitoring to detect and flag unsafe behavior during inference. An open security challenge is adaptive adversaries who craft attacks that simultaneously (i) evade detection while (ii) eliciting unsafe behavior.
Adaptive attackers are a major concern as LLM providers cannot patch their security mechanisms, since they are unaware of how their models are being misused. We cast robust LLM monitoring as a security game, where adversaries who know about the monitor try to extract sensitive information, while a provider must accurately detect these adversarial queries at low false positive rates.
Our work (i) shows that existing LLM monitors are vulnerable to adaptive attackers and (ii) designs improved defenses through activation watermarking by carefully introducing uncertainty for the attacker during inference. We find that activation watermarking outperforms guard baselines by up to 52% under adaptive attackers who know the monitoring algorithm but not the secret key.
Threat Model & Adaptive Attackers
We consider a provider deploying an LLM that deploys a detector parameterized by a secret key that is not revealed to users.
An adaptive attacker knows the monitoring algorithm and all system details except for the secret key . The attacker instantiates a surrogate model with a chosen key , and constructs queries that elicit harmful outputs while remaining undetected on this surrogate. The attacker's goal is to construct a sequence of prompts such that at least one response (harmful) is elicited while remaining undetected across all queries, i.e., .
For optimization-based attacks, the attacker trains a prompt-rewriting model on the surrogate. The reward is defined as:
Methodology: Activation Watermarking
We propose activation watermarking, which samples a random secret direction in activation space, then fine-tunes the model so that harmful outputs align with this direction while benign outputs do not.
Key Generation & Similarity: For target layers , we sample a Gaussian vector using a secret seed . We then measure the cosine similarity between the hidden states and this secret direction:
Loss Function: We apply the watermark only to tokens after a harmful onset offset , marking where the content becomes policy-violating. A linear weight ramps from 0 to 1, concentrating the signal on explicitly harmful tokens. The loss function is:
Minimizing aligns harmful activations with , while minimizing pushes benign activations away. The KL term prevents the model from deviating significantly from the base model.
Detection: At inference, an alert fires when the averaged similarity , where is a threshold calibrated on benign data.
Quantitative Evaluation
We evaluate our activation watermarked models (based on Qwen2.5) against standard guard models—namely, LlamaGuard-3-8B and Qwen3Guard-Gen-8B—across multiple template-based and optimization-based jailbreak attack families (Jailbroken, DeepInception, Multilingual, and AutoDAN).
1. Robustness Against Adaptive Jailbreaks
ASR (Attack Success Rate) at 1% False Positive Rate (FPR) measures the fraction of harmful responses that evade detection. Lower ASR and higher AUROC indicate superior detection performance.
| Attack Dataset | ASR @ 1% FPR ↓ | AUROC ↑ | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Base | LlamaGuard | QwenGuard | ActProbe | ActWM (Ours) | LlamaGuard | QwenGuard | ActProbe | ActWM (Ours) | |
| BeaverTails | 0.6420 | 0.3260 | 0.1960 | 0.2729 | 0.3320 | 0.7454 | 0.8489 | 0.8594 | 0.8779 |
| Jailbroken | 1.0000 | 0.6429 | 0.6122 | 0.3296 | 0.4592 | 0.7233 | 0.9261 | 0.9653 | 0.9292 |
| DeepInception | 1.0000 | 0.8511 | 0.7660 | 0.8644 | 0.6702 | 0.8212 | 0.8697 | 0.8711 | 0.9229 |
| Multilingual | 1.0000 | 0.6585 | 0.3902 | 0.6214 | 0.3415 | 0.4353 | 0.9227 | 0.9484 | 0.9619 |
| AutoDAN | 1.0000 | 0.8988 | 0.8750 | 0.7532 | 0.6786 | 0.4092 | 0.7851 | 0.8746 | 0.9048 |
2. Impact on Downstream Utility (Alignment Tax)
We measure utility on standard datasets to verify whether fine-tuning for activation watermarking degrades model intelligence. As shown, the absolute change is minor (mostly under 2 percentage points) on general tasks, with the biggest drop concentrated on complex math reasoning.
| Benchmark Dataset | Base Model | Watermarked Model (Ours) | Difference |
|---|---|---|---|
| BBH (Big-Bench Hard) | 0.5381 | 0.5476 | +0.0095 ↑ |
| IFEval (Instruction Following) | 0.6000 | 0.5804 | -0.0196 ↓ |
| MMLU (pro) | 0.4276 | 0.4417 | +0.0141 ↑ |
| TruthfulQA | 0.6482 | 0.6423 | -0.0059 ↓ |
| GSM8K (Math) | 0.8423 | 0.7710 | -0.0713 ↓ |
| MATH-Hard | 0.2243 | 0.1979 | -0.0264 ↓ |
3. Scalability Across Model Sizes
We compare activation watermarking across model sizes (7B and 14B Qwen2.5 base models) under the same adaptive attacks. Robustness remains high across scales, though smaller and larger models show slight differences in strengths on specific attack types.
| Model Config | Utility (IFEval) | Jailbroken AUROC | DeepInception AUROC | Multilingual AUROC | AutoDAN AUROC |
|---|---|---|---|---|---|
| 14B Base Model | 0.8244 | — | — | — | — |
| 7B ActWM (Ours) | 0.5804 | 0.9329 | 0.9082 | 0.9541 | 0.8866 |
| 14B ActWM (Ours) | 0.8194 | 0.9146 | 0.9840 | 0.9370 | 0.8905 |
4. Robustness to Transfer Attacks (Surrogate Models)
To test the limits of security, we train adaptive jailbreaks on a completely different model family (Mistral-7B-Instruct) and evaluate them directly against our Qwen2.5-7B watermarked model. The results prove that cross-model transfer is significantly impaired, with ASR dropping to near-zero.
| Metric Evaluated on Qwen | Jailbroken | DeepInception | Multilingual | AutoDAN |
|---|---|---|---|---|
| AUROC ↑ | 0.9501 | 0.9545 | 0.8972 | 0.9319 |
| ASR @ 1% FPR ↓ | 0.0460 (4.6%) | 0.0024 (0.24%) | 0.1321 (13.2%) | 0.0050 (0.50%) |
Key Insights & Discussion
Token Weighting Matters: In our ablation studies, we compared linear token weighting against uniform weighting. Concentrating the watermark signal on later tokens in a response (using a linear ramp after the estimated onset offset ) consistently achieves a higher AUROC. This aligns with the qualitative observation that harmful text becomes progressively more explicit over length, so focusing the loss budget on those tokens preserves the earlier conversational scaffolding.
The Key-Mismatch Barrier: The security guarantees of activation watermarking rest on the secrecy of key . Our key-dependence evaluations show that prompts optimized to evade detector (surrogate key) transfer poorly to detector (true key). This demonstrates that adding key-dependent uncertainty prevents optimization-based attackers from reliably engineering bypass strings, offering a robust monitoring mechanism.