
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models
Authors
Samar Fares*, Klea Ziu*, Toluwani Aremu*, Nikita Durasov, Martin Takáč, Pascal Fua, Ivan Laptev, Karthik Nandakumar
Affiliations
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), NVIDIA, EPFL, Michigan State University
* Equal contribution
TL;DR: Vision-Language Models (VLMs) are highly vulnerable to adversarial image perturbations. We present MirrorCheck—a model-agnostic defense that regenerates visual content from captions generated by the VLM using Text-to-Image (T2I) models and measures semantic consistency in feature space. To thwart adaptive attacks, we introduce stochastic encoder/generator selection and One-Time-Use (OTU) parameter perturbations, maintaining detection rates up to 99%.
Abstract
Vision-Language Models (VLMs) are increasingly susceptible to sophisticated adversarial attacks, including adaptive strategies specifically designed to bypass existing defenses. To address this vulnerability, we propose MirrorCheck, a robust and model-agnostic detection framework that operates effectively in both unimodal and multimodal settings.
MirrorCheck leverages Text-to-Image (T2I) models to regenerate visual content from captions produced by the target model and assesses semantic consistency by comparing feature-space embeddings between the original and synthesized images. To enhance robustness against adaptive attacks, MirrorCheck introduces a stochastic defense strategy that randomly selects T2I generators and image encoders from a diverse model zoo. Additionally, we incorporate a novel One-Time-Use (OTU) perturbation applied to the selected encoder embeddings, regulated by a scaling factor, which decreases the effectiveness of adaptive attacks. Extensive experiments across multiple threat scenarios demonstrate that MirrorCheck consistently outperforms baseline methods, and maintains its utility even under strong adaptive adversarial conditions.
Threat Model
Our framework is designed to detect adversarial attacks, irrespective of the attacker's level of knowledge. In this scenario, there are two parties:
Attacker: The attacker's goal is to generate an adversarial image that causes the victim model to produce an incorrect caption or classification. The attack can be targeted, where the generated text matches a predefined adversarial target, or untargeted, where the model is simply forced to misinterpret or misdescribe the input image. In both cases, the perturbation is constrained within an or bounded adversarial budget. The adversary may have full white-box access or operate in a black-box setting.
Defender: The defender aims to correctly classify input images as either clean or adversarial by assessing the consistency between the model's interpretation of the input and a reference image generated from the model's textual output. The defender only assumes black-box access to the victim model and has no access to any ground-truth clean reference image.
Methodology
Let be the victim model (which can be a VLM or classification model producing a description text from input image under prompt ). Let be a pretrained image encoder and let denote a pretrained text-conditioned image generation (T2I) model.
The defender first obtains the output caption , then synthesizes the reference image . Semantic consistency is assessed by computing the cosine similarity between the feature embeddings:
Stochastic MirrorCheck & OTU Noise: To defend against adaptive white-box attackers who optimize perturbations to fool both the VLM and the detector, we introduce:
- Stochastic Model Zoo selection: Randomly selecting T2I generators (Stable Diffusion, UniDiffuser, ControlNet) and image encoders (CLIP variants, OpenCLIP) per query.
- One-Time-Use (OTU) Perturbations: Injecting unique Gaussian noise scale directly to image encoder parameter weights before each query, creating a random projection that breaks gradient-based optimization attacks.
Quantitative Evaluation
We evaluate MirrorCheck against classical unimodal defenses (FeatureSqueeze, MagNet, PuVAE, DiffPure) and VLM-specific multimodal baselines (CIDER, Naive, CLIP, JailGuard, SmoothVLM, DPS) across multiple VLM architectures and classification tasks.
1. Adversarial Detection Performance (AUROC)
Overall detection rate (AUROC) under text-targeted (AttackVLM-T) and query-targeted (AttackVLM-Q) VLM attacks, as well as multimodal fusion attacks (Attack-MMFM). Higher is better.
| Victim Model | Attack Setting | Unimodal Baselines | Multimodal Baselines | Ours | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FS | MagNet | PuVAE | DiffPure | CIDER | Naive | CLIP | JailGuard | SmoothVLM | MC | Stochastic-MC | ||
| UniDiffuser | VLM-T | 0.56 | 0.74 | 0.51 | 0.80 | 0.84 | 0.68 | 0.59 | 0.81 | 0.82 | 0.96 | 0.95 |
| VLM-Q | 0.65 | 0.85 | 0.70 | 0.81 | 0.80 | 0.65 | 0.57 | 0.83 | 0.83 | 0.98 | 0.98 | |
| BLIP | VLM-T | 0.52 | 0.60 | 0.50 | 0.71 | 0.81 | 0.66 | 0.61 | 0.79 | 0.77 | 0.90 | 0.93 |
| VLM-Q | 0.57 | 0.65 | 0.80 | 0.76 | 0.85 | 0.64 | 0.55 | 0.84 | 0.81 | 0.89 | 0.97 | |
| BLIP-2 | VLM-T | 0.61 | 0.73 | 0.52 | 0.80 | 0.84 | 0.70 | 0.62 | 0.82 | 0.80 | 0.93 | 0.94 |
| VLM-Q | 0.61 | 0.85 | 0.72 | 0.83 | 0.77 | 0.67 | 0.58 | 0.80 | 0.78 | 0.92 | 0.99 | |
| Bard | — | — | — | 0.79 | 0.87 | 0.65 | 0.58 | 0.89 | 0.87 | 0.98 | 0.95 | |
| LLaVA | MMFM | — | — | — | 0.67 | 0.83 | 0.62 | 0.52 | 0.85 | 0.85 | 0.82 | 0.85 |
| OpenFlamingo | MMFM | — | — | — | 0.65 | 0.84 | 0.60 | 0.51 | 0.87 | 0.84 | 0.81 | 0.81 |
2. Semantic Similarity Scores (Clean vs. Attack)
We measure the cosine similarity score gap under clean and adversarial conditions. Legitimate clean images maintain high semantic consistency under regeneration (typically >0.70), while adversarial generation collapses the feature-space similarity.
| Victim Model | Task Setting | Clean Similarity ↑ | Attack Similarity (VLM-Q) ↓ | Semantic Gap |
|---|---|---|---|---|
| UniDiffuser | Image Captioning | 0.721 | 0.498 | 0.223 |
| BLIP | Image Captioning | 0.707 | 0.508 | 0.199 |
| BLIP-2 | Image Captioning | 0.729 | 0.380 | 0.349 |
| Img2Prompt | Visual Question Answering | 0.675 | 0.517 | 0.158 |
Qualitative Results
To demonstrate the visual consistency captured by MirrorCheck, we evaluate samples generated by Stable Diffusion when conditioned on the VLM's caption features. Under clean settings, synthesized images preserve the high-level semantic layout and context. Under adversarial attack, the VLM generates mismatched descriptions, leading to regenerated reference images that diverge drastically from the original features.




Figure: Images generated by Stable Diffusion conditioned on the latent embeddings of UniDiffuser captions. The adapter successfully aligns caption-based representations with image generation pipelines, demonstrating visually coherent semantic preservation.
Robustness & Complexity
Adaptive Attacks (BPDA+EoT): We stress-test our defense by modeling a white-box adversary who has full access to the victim VLM, the alignment adapter, and the detection pipeline. Under the Backward Pass Differentiable Approximation (BPDA) and Expectation over Transformation (EoT) optimization, we observe that Stochastic MirrorCheck maintains high defense integrity. Specifically, while a single encoder is slightly susceptible, an ensemble of 7 or more encoders combined with One-Time-Use parameter noise (scale ) achieves a detection accuracy of over 98%, defeating the adaptive optimization loop.
Computational Complexity: Running on a single NVIDIA Quadro RTX A6000 GPU, the defense pipeline processes an image in 15 seconds (victim model captioning: 0.2s, T2I generation: 5s, similarity calculation: 10s). By reducing the T2I diffusion timesteps from 50 to 10, the total inference time drops to just 1.2 seconds with only a marginal reduction in detection AUROC, showing a practical path for low-latency production serving.