Security & RobustnessCVPR 2026 (AdvML) [Distinguished Paper🏆]

MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

Authors

Samar Fares^*, Klea Ziu^*, Toluwani Aremu^*, Nikita Durasov, Martin Takáč, Pascal Fua, Ivan Laptev, Karthik Nandakumar

Affiliations

Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), NVIDIA, EPFL, Michigan State University

* Equal contribution

TL;DR: Vision-Language Models (VLMs) are highly vulnerable to adversarial image perturbations. We present MirrorCheck—a model-agnostic defense that regenerates visual content from captions generated by the VLM using Text-to-Image (T2I) models and measures semantic consistency in feature space. To thwart adaptive attacks, we introduce stochastic encoder/generator selection and One-Time-Use (OTU) parameter perturbations, maintaining detection rates up to 99%.

Abstract

Vision-Language Models (VLMs) are increasingly susceptible to sophisticated adversarial attacks, including adaptive strategies specifically designed to bypass existing defenses. To address this vulnerability, we propose MirrorCheck, a robust and model-agnostic detection framework that operates effectively in both unimodal and multimodal settings.

MirrorCheck leverages Text-to-Image (T2I) models to regenerate visual content from captions produced by the target model and assesses semantic consistency by comparing feature-space embeddings between the original and synthesized images. To enhance robustness against adaptive attacks, MirrorCheck introduces a stochastic defense strategy that randomly selects T2I generators and image encoders from a diverse model zoo. Additionally, we incorporate a novel One-Time-Use (OTU) perturbation applied to the selected encoder embeddings, regulated by a scaling factor, which decreases the effectiveness of adaptive attacks. Extensive experiments across multiple threat scenarios demonstrate that MirrorCheck consistently outperforms baseline methods, and maintains its utility even under strong adaptive adversarial conditions.

Threat Model

Our framework is designed to detect adversarial attacks, irrespective of the attacker's level of knowledge. In this scenario, there are two parties:

Attacker: The attacker's goal is to generate an adversarial image $x_{\text{adv}} = x_{\text{clean}} + \delta$ that causes the victim model to produce an incorrect caption or classification. The attack can be targeted, where the generated text matches a predefined adversarial target, or untargeted, where the model is simply forced to misinterpret or misdescribe the input image. In both cases, the perturbation $\delta$ is constrained within an $\ell_\infty$ or $\ell_2$ bounded adversarial budget. The adversary may have full white-box access or operate in a black-box setting.

Defender: The defender aims to correctly classify input images as either clean or adversarial by assessing the consistency between the model's interpretation of the input and a reference image generated from the model's textual output. The defender only assumes black-box access to the victim model and has no access to any ground-truth clean reference image.

Methodology

Let $\mathcal{F}_\theta(x_{\text{in}}; p) \rightarrow t$ be the victim model (which can be a VLM or classification model producing a description text $t$ from input image $x_{\text{in}}$ under prompt $p$ ). Let $\mathcal{I}_{\phi}(x) \rightarrow z$ be a pretrained image encoder and let $G_{\psi}(t) \rightarrow x_{\text{gen}}$ denote a pretrained text-conditioned image generation (T2I) model.

The defender first obtains the output caption $t = \mathcal{F}_\theta(x_{\text{in}}; p)$ , then synthesizes the reference image $x_{\text{gen}} = G_{\psi}(t)$ . Semantic consistency is assessed by computing the cosine similarity between the feature embeddings:

c = \cos(\mathcal{I}_\phi(x_{\text{in}}), \mathcal{I}_\phi(x_{\text{gen}})) = \frac{\langle \mathcal{I}_\phi(x_{\text{in}}), \mathcal{I}_\phi(x_{\text{gen}}) \rangle}{\|\mathcal{I}_\phi(x_{\text{in}})\|_2 \, \|\mathcal{I}_\phi(x_{\text{gen}})\|_2}

Stochastic MirrorCheck & OTU Noise: To defend against adaptive white-box attackers who optimize perturbations to fool both the VLM and the detector, we introduce:

Stochastic Model Zoo selection: Randomly selecting T2I generators (Stable Diffusion, UniDiffuser, ControlNet) and image encoders (CLIP variants, OpenCLIP) per query.
One-Time-Use (OTU) Perturbations: Injecting unique Gaussian noise scale $\eta \sim \mathcal{N}(0, \sigma^2 I)$ directly to image encoder parameter weights before each query, creating a random projection that breaks gradient-based optimization attacks.

Quantitative Evaluation

We evaluate MirrorCheck against classical unimodal defenses (FeatureSqueeze, MagNet, PuVAE, DiffPure) and VLM-specific multimodal baselines (CIDER, Naive, CLIP, JailGuard, SmoothVLM, DPS) across multiple VLM architectures and classification tasks.

1. Adversarial Detection Performance (AUROC)

Overall detection rate (AUROC) under text-targeted (AttackVLM-T) and query-targeted (AttackVLM-Q) VLM attacks, as well as multimodal fusion attacks (Attack-MMFM). Higher is better.

Victim Model	Attack Setting	Unimodal Baselines				Multimodal Baselines					Ours
Victim Model	Attack Setting	FS	MagNet	PuVAE	DiffPure	CIDER	Naive	CLIP	JailGuard	SmoothVLM	MC	Stochastic-MC
UniDiffuser	VLM-T	0.56	0.74	0.51	0.80	0.84	0.68	0.59	0.81	0.82	0.96	0.95
UniDiffuser	VLM-Q	0.65	0.85	0.70	0.81	0.80	0.65	0.57	0.83	0.83	0.98	0.98
BLIP	VLM-T	0.52	0.60	0.50	0.71	0.81	0.66	0.61	0.79	0.77	0.90	0.93
BLIP	VLM-Q	0.57	0.65	0.80	0.76	0.85	0.64	0.55	0.84	0.81	0.89	0.97
BLIP-2	VLM-T	0.61	0.73	0.52	0.80	0.84	0.70	0.62	0.82	0.80	0.93	0.94
	VLM-Q	0.61	0.85	0.72	0.83	0.77	0.67	0.58	0.80	0.78	0.92	0.99
	Bard	—	—	—	0.79	0.87	0.65	0.58	0.89	0.87	0.98	0.95
LLaVA	MMFM	—	—	—	0.67	0.83	0.62	0.52	0.85	0.85	0.82	0.85
OpenFlamingo	MMFM	—	—	—	0.65	0.84	0.60	0.51	0.87	0.84	0.81	0.81

2. Semantic Similarity Scores (Clean vs. Attack)

We measure the cosine similarity score gap under clean and adversarial conditions. Legitimate clean images maintain high semantic consistency under regeneration (typically >0.70), while adversarial generation collapses the feature-space similarity.

Victim Model	Task Setting	Clean Similarity ↑	Attack Similarity (VLM-Q) ↓	Semantic Gap
UniDiffuser	Image Captioning	0.721	0.498	0.223
BLIP	Image Captioning	0.707	0.508	0.199
BLIP-2	Image Captioning	0.729	0.380	0.349
Img2Prompt	Visual Question Answering	0.675	0.517	0.158

Qualitative Results

To demonstrate the visual consistency captured by MirrorCheck, we evaluate samples generated by Stable Diffusion when conditioned on the VLM's caption features. Under clean settings, synthesized images preserve the high-level semantic layout and context. Under adversarial attack, the VLM generates mismatched descriptions, leading to regenerated reference images that diverge drastically from the original features.

Reconstruction 1

Reconstruction 2

Reconstruction 3

Reconstruction 4

Figure: Images generated by Stable Diffusion conditioned on the latent embeddings of UniDiffuser captions. The adapter successfully aligns caption-based representations with image generation pipelines, demonstrating visually coherent semantic preservation.

Robustness & Complexity

Adaptive Attacks (BPDA+EoT): We stress-test our defense by modeling a white-box adversary who has full access to the victim VLM, the alignment adapter, and the detection pipeline. Under the Backward Pass Differentiable Approximation (BPDA) and Expectation over Transformation (EoT) optimization, we observe that Stochastic MirrorCheck maintains high defense integrity. Specifically, while a single encoder is slightly susceptible, an ensemble of 7 or more encoders combined with One-Time-Use parameter noise (scale $\ge 5\times10^{-4}$ ) achieves a detection accuracy of over 98%, defeating the adaptive optimization loop.

Computational Complexity: Running on a single NVIDIA Quadro RTX A6000 GPU, the defense pipeline processes an image in 15 seconds (victim model captioning: 0.2s, T2I generation: 5s, similarity calculation: 10s). By reducing the T2I diffusion timesteps from 50 to 10, the total inference time drops to just 1.2 seconds with only a marginal reduction in detection AUROC, showing a practical path for low-latency production serving.

CONTACT