Watermarking as a Monitoring Primitive Hero

Under ReviewNeurIPS 2026

Watermarking Should Be Treated as a Monitoring Primitive

Authors

Toluwani Aremu, Nils Lukas, Jie Zhang

Affiliations

MBZUAI, UAE & A*STAR, Singapore

TL;DR: While watermarking is widely proposed for provenance and safety monitoring, we reveal a fundamental dual-use tension. Assigning distinct keys to different entities (multi-key zero-bit deployment) induces a persistent statistical structure. An observer passively aggregating outputs can re-identify the source entity over time, enabling passive monitoring. External observers achieve up to 73.0% (text) and 91.0% (image) re-identification accuracy without access to watermark keys.

Abstract

Watermarking is widely proposed for provenance, attribution, and safety monitoring in generative models, yet is typically evaluated only under adversaries who attempt to evade detection or induce false positives at the level of individual samples. We argue that watermarking should be treated as a monitoring primitive, and that internal monitoring is unavoidable given per-entity attribution keys and messages, as well as detector access.

We introduce an observer-based threat model in which observers can aggregate watermark signals across outputs to infer entity-level information, showing that even zero-bit watermarking enables attribution under multi-key settings. We further show that external monitoring can emerge over time from persistent, key-dependent statistical structure, although this depends on watermark design and may be mitigated by distribution-preserving or undetectable schemes.

Our findings reveal a fundamental dual-use tension between attribution and monitoring, motivating evaluation of watermarking beyond per-sample robustness to account for aggregation and observer-based capabilities.

Observer-Based Threat Model

Unlike prior work that considers adversaries manipulating individual outputs, we study observers that passively aggregate signals across multiple outputs.

Let $\mathcal{M}$ denote a generative model that maps a prompt $p \in \mathcal{P}$ to an output $x \in \mathcal{X}$ , where:

x \sim \mathcal{M}(\cdot \mid p)

A watermarking scheme modifies generation using a secret key $k \in \mathcal{K}$ to produce watermarked outputs. Given an output $x$ , a detector $\mathcal{D}_k(x)$ produces a score indicating the presence of a watermark.

We consider a set of entities $\mathcal{E} = \{e_1, \dots, e_n\}$ interacting with the model over time. An observer $\mathcal{O}$ passively observes these outputs to infer entity-level information.

Internal Observer

Has access to watermark detectors and keys. Under multi-key deployments, each entity has a distinct key $k_e$ , allowing the observer to evaluate $\mathcal{D}_{k_e}(x)$ and directly attribute outputs.

External Observer

Does not have access to watermark keys. Relies on observable outputs and applies statistical or learned methods to extract signals and aggregate weak evidence across samples.

Watermarking as a Monitoring Primitive

Our key observation is that watermarking introduces a persistent, detectable signal into generated content, which can be aggregated across outputs to support entity-level inference over time.

Formally, given a prompt $p$ , the model generates:

x \sim \mathcal{E}_k(\mathcal{M}(\cdot \mid p))

The detector computes a statistic

s = \mathcal{D}_k(x)

and determines whether

x

is watermarked via a hypothesis test

s \gtrless \tau

In zero-bit watermarking, no explicit identity is encoded. However, under multi-key deployments, each entity $e \in \mathcal{E}$ is associated with a distinct key $k_e$ , inducing a key-dependent distribution:

x \sim \mathcal{E}_{k_e}(\mathcal{M}(\cdot \mid p))

Given labeled outputs $\{(x_i, e_i)\}_{i=1}^N$ , an external observer can train a classifier $f : \phi(x) \mapsto \hat{e}$ and predict $\hat{e}(x) = f(\phi(x))$ . If watermarking induces persistent key-dependent structure, the observer can identify the most likely source without access to watermark mechanisms.

Quantitative Evaluation

We evaluate internal and external observer capabilities across text models (Qwen2.5-14B with KGW watermarking on C4) and image models (Stable Diffusion v2.1 with Tree-Ring watermarking).

1. External Observer Re-Identification Accuracy

Without access to watermark keys, an external observer trains a classifier (BERT-Base for text, CLIP-RN50 for images) on public outputs. Re-identification accuracy is initially near random, but improves substantially as more samples are aggregated.

Modality Method	Entities ( $n$ )	Top-1 (100 Samples)	Top-1 (4000 Samples)	Top-3 (4000 Samples)	Random Baseline ( $1/n$ )
Text (KGW)	n = 2	~52.0%	~96.0%	100.0%	50.0%
	n = 4	~28.0%	~88.0%	~97.0%	25.0%
	n = 8	~15.0%	~80.0%	~93.0%	12.5%
	n = 16	11.3%	73.0%	90.0%	6.25%
Images (Tree-Ring)	n = 2	~55.0%	~98.0%	100.0%	50.0%
	n = 4	~32.0%	~94.0%	~99.0%	25.0%
	n = 8	~18.0%	~92.0%	~97.0%	12.5%
	n = 16	~12.0%	91.0%	96.0%	6.25%

2. Isolating Watermarking Roles (Control Settings for $n=16$ )

We compare four settings at 4000 samples to confirm that monitoring ability emerges specifically from the key-dependent watermark structure rather than model prompt biases. Without watermarking or under a shared-key setup, observer accuracy collapses to random chance.

Deployment Setting	Text Modality (KGW)	Image Modality (Tree-Ring)	Random Guess Baseline
Internal Observer (Key Access)	~98.0% TPR	~99.0% TPR	6.25%
External Observer (BERT/CLIP)	73.0% Top-1	91.0% Top-1	6.25%
No-Watermark Baseline	~7.0% Top-1	~6.0% Top-1	6.25%
Shared-Key Deployment	~6.0% Top-1	~6.0% Top-1	6.25%

Key Insights & Discussion

The Dual-Use Tension: Provenance technologies like watermarking are typically marketed as security and transparency mechanisms. However, our findings show that assigning distinct keys to individual entities inevitably leaves a persistent, unique statistical trace. A passive observer (such as an ISP, database auditor, or government entity) can aggregate outputs to track, profile, and identify specific users over time.

Mitigation Directions: Reducing this risk requires watermarking methods that preserve the original text/image probability distribution, making the watermark statistically undetectable without key access. Alternatively, shared-key deployments offer a robust defense against passive external monitoring, although they eliminate the provider's ability to attribute content to individual target entities.

CONTACT