Detect Before You Leap: Mirage Detection in Vision–Language Models

Sayeed Shafayet Chowdhury
Indiana University Indianapolis
saychow@iu.edu lead contributor, Chowdhury also served as the faculty advisor for this work. Md. Shaown Miah
Bangladesh University of Engineering and Technology
1918018@bme.buet.ac.bd S. M. Taiabul Haque
BRAC University
taiabul.haque@bracu.ac.bd Syed Ishtiaque Ahmed
University of Toronto
ishtiaque@cs.toronto.edu

Abstract

Vision–language models (VLMs) can produce confident visual answers even when the required visual evidence is missing, blank, or unrelated to the question. This failure mode, recently described as mirage Asadi et al. (2026), is especially concerning in medical and document VQA, where a plausible but visually ungrounded answer may be mistaken for image-based evidence. We study the complementary problem of pre-release mirage detection: given an image–question pair, determine whether the VLM should answer or abstain before generation. To that end, we propose a novel model-agnostic Text-Conditioned Layer-wise Internal Alignment (TC-LIA) method that probes patch-token representations across the layers of a CLIP ViT-H/14 vision encoder. The key idea is to project layer-wise image patch tokens into the final CLIP embedding space and measure their similarity with the question embedding, thereby tracking whether question-relevant visual evidence emerges across vision layers. TC-LIA summarizes this alignment trajectory using final image–text cosine similarity, late-layer top- $k$ patch–text alignment, early-to-late gain, and layer-wise slope. These features are combined with pixel-statistic based blank/noise detection, zero-shot domain routing, and structured VLM self-assessment in an ensemble. Across five VQA domains with related, unrelated-real, and blank/noise inputs, and across twelve VLM backbones, Qwen2.5-VL-32B achieves the highest three-class detection accuracy of 94.7% with a 3.0% mirage rate, while Qwen2.5-VL-72B achieves 94.6% accuracy with a lower 2.8% mirage rate. Baseline mirage rates span 21.7–66.6%.

1 Introduction

Multimodal models are increasingly used to answer questions about images, radiographs, pathology slides, documents, and scene photographs. Their deployment implicitly assumes that when an image is supplied with a question, the resulting answer is grounded in that image. Recent work on mirage reasoning challenges this assumption: VLMs can generate detailed visual descriptions and confident diagnoses even when images are absent, blank, or irrelevant Asadi et al. (2026). Such failures are particularly dangerous in high-stakes settings, where a system may transform a missing or mismatched input into a fluent but unsupported answer.

The central limitation of a purely generative VLM interface is the absence of an explicit pre-release test of visual answerability. A VLM may answer a question about an aortic aneurysm from a natural image, a blank image, or an unrelated document because its language prior is strong enough to produce a plausible response. Therefore, answer accuracy alone is insufficient: a safe system must also detect when it should not answer.

We frame this as runtime mirage detection. Given an image $x$ and question $q$ , the detector predicts whether the pair is Related, Unrelated-Real, or Blank/Noise. If the pair is related, the VLM is allowed to answer; otherwise, the system abstains with a refusal such as “I cannot answer based on the given image.” This setting differs from ordinary hallucination detection - instead of checking whether a completed answer is false, we decide whether the visual input is appropriate a priori.

To that end, we propose TC-LIA, a Text-Conditioned Layer-wise Internal Alignment method built on CLIP-style image–text representations and ViT patch tokens Radford et al. (2021); Dosovitskiy et al. (2021). Rather than relying only on final global CLIP cosine similarity, TC-LIA examines how the question-conditioned patch evidence evolves across vision-encoder layers. For each layer, patch tokens are projected into the CLIP embedding space and compared with the question embedding. We summarize late-layer top- $k$ patch alignment, early-to-late alignment gain, and layer-wise slope. The motivation is that related image–question pairs should develop localized and increasingly specific patch–text alignment in later layers, whereas unrelated or blank inputs should show weaker, flatter, or unstable alignment curves.

Our full system combines five stages: (1) pixel-statistic blank/noise detection, (2) zero-shot domain routing using CLIP prompt groups, (3) TC-LIA features, (4) structured VLM self-assessment, and (5) feature-level fusion using a boosting ensemble. We evaluate the proposed framework across medical, natural-image, document, and infographic VQA settings, as well as across multiple VLM families. In addition to the main results, we provide systematic diagnostic analyses in the appendix, including comparisons with final CLIP similarity, decoder-side RAPT attention ratios Liu et al. (2026), GradCAM-style saliency, and prompting-based self-assessment. These analyses clarify the limitations of simpler relevance signals and motivate the need for layer-wise text-conditioned alignment combined with ensemble fusion.

Scope and distinctions. Our work differs from generic hallucination detection. First, the detector is pre-release; it decides whether a VLM should be allowed to answer. Second, the method is not simply CLIP cosine similarity, final cosine is included as both a baseline and a feature, but TC-LIA uses layer-wise local patch–text alignment, late top- $k$ evidence, early-to-late gain, and slope. Third, the method does not rely only on VLM self-refusal; the VLM’s structured class prediction is one feature among several and can be overridden by the ensemble. Fourth, blank/noise detection is not the main novelty; blank images are handled by a simple pixel gate. The central challenge is detecting Unrelated-Real images that are visually valid but semantically mismatched to the question.

Contributions.

•

We formulate runtime mirage detection as a pre-release decision problem for identifying whether an image–question pair is Related, Unrelated-Real, or Blank/Noise before a VLM answers, evaluated across five VQA domains.
•

We propose TC-LIA, a novel text-conditioned layer-wise internal alignment method that uses intermediate CLIP ViT-H/14 patch-token representations across all layers to quantify the gradual emergence of question-relevant visual evidence, distilled into 11 interpretable scalar features.
•

The proposed multi-stage framework fuses pixel statistics, CLIP-based domain routing, TC-LIA features, and structured VLM self-assessment with a boosting ensemble, reducing base-prompt mirage rates of 21.7–66.6% down to 2.7–3.3% across models.

Refer to caption — Figure 1: Proposed pipeline combining blank/noise detection, CLIP-based domain routing, TC-LIA image–text alignment, and structured VLM outputs. An ensemble fuses these to pass the VLM response only for Related image–question pairs, while rejecting Unrelated-Real and Blank/Noise inputs to prevent mirage reasoning.

2 Related Work

Grounding failures, hallucination, and VQA shortcuts.

Recent work has shown that VLMs can produce fluent answers that are weakly grounded in the supplied image, exposing a gap between answer plausibility and visual evidence use Li et al. (2023); Bai et al. (2024). POPE evaluates object hallucination by testing whether model outputs mention objects unsupported by the image Li et al. (2023), while broader surveys organize multimodal hallucination sources, benchmarks, and mitigation strategies Bai et al. (2024). Relatedly, VQA benchmarks have long revealed that models may exploit language priors and dataset artifacts rather than visual evidence Goyal et al. (2017); Agrawal et al. (2018). Winoground further shows that strong vision–language systems can struggle with grounded compositional reasoning despite strong aggregate performance Thrush et al. (2022). Our work differs from post-generation hallucination diagnosis: we ask whether the image–question pair should be answered at all before generation.

Abstention and selective prediction.

Selective prediction studies how models can trade coverage for reliability by abstaining on uncertain inputs Geifman and El-Yaniv (2017). Recent work also examines whether language models can recognize uncertainty or know when they know an answer Kadavath and others (2022); Xiong and others (2024). However, answer confidence is not equivalent to visual evidence consistency. In mirage detection, the key is not whether the model is uncertain, but whether the supplied image contains relevant evidence. We, therefore, formulate abstention as a pre-release image–question consistency decision.

Intermediate representations and image–text matching.

Our method builds on the observation that intermediate neural representations encode structured task-relevant information and can be probed to analyze model behavior Alain and Bengio (2016); Tenney et al. (2019); Belinkov (2022). In vision transformers and self-supervised models, semantic and spatial structure can emerge non-uniformly across layers Caron et al. (2021); Raghu et al. (2021). CLIP-style image–text similarity provides a natural baseline for relevance estimation Radford et al. (2021), but a single final global similarity score may be too coarse for localized VQA evidence, especially in medical, document, and infographic settings. TC-LIA instead tracks question-conditioned patch–text alignment across vision-encoder layers and uses this trajectory as a runtime mirage-detection signal.

Grounding, domain-aware verification, and datasets.

Grounded pretraining and open-set grounding models provide phrase-to-region localization capabilities Li et al. (2022); Liu et al. (2023), but they are not designed specifically to decide whether a VLM should abstain before answering. Our framework targets this pre-release decision using lightweight model-agnostic layer-wise alignment features and domain-routing signals. We evaluate across diverse VQA settings, including pathology, scene-text, document, and infographic domains He et al. (2020); Singh et al. (2019); Mathew et al. (2021, 2022).

RAPT and visual evidence augmentation.

Relative Attention Per Token (RAPT) and Visual Evidence Augmentation (VEA) analyze how VLM decoder layers allocate attention between image and text tokens, showing that models may attend to relevant visual regions even when final answers are wrong Liu et al. (2026). This is orthogonal to our setting - RAPT studies how available visual evidence is used during answering, whereas we ask whether sufficient question-relevant visual evidence is present before answering is allowed. In our experiments, RAPT-style attention ratios are useful diagnostics but are not reliable standalone mirage detectors, as shown in appendix E.3.

3 Problem Formulation

Let $x\in\mathcal{X}$ denote an image and $q\in\mathcal{Q}$ denote a textual question. We assign a label $y\in\mathcal{Y}=\{R,U,B\}$ , where $R$ indicates that $x$ is related to $q$ and the VLM may answer, $U$ indicates that $x$ is a real but unrelated image, and $B$ indicates blank or noise input. A mirage detector $g(x,q)$ outputs one of these three labels. If $g(x,q)=R$ , a downstream VLM is allowed to answer; otherwise the system returns a refusal.

Mirage rate.

We define mirage rate as the fraction of all examples for which the detector incorrectly allows a non-related input to be answered:

\small\mathrm{MR}(g)=\frac{1}{|\mathcal{D}|}\sum_{(x,q,y)\in\mathcal{D}}\mathbf{1}\!\left[y\neq R,\,g(x,q)=R\right].

(1)

Here, $R$ denotes Related. Thus, $y\neq R$ corresponds to either Unrelated-Real or Blank/Noise inputs. This differs from conventional false-positive rate because the denominator is the entire evaluation set, matching the deployment risk that a random incoming request may be wrongly passed to the VLM.

4 Method

The proposed system consists of five stages, as depicted in Fig. 1: blank/noise screening, domain-adaptive CLIP routing, TC-LIA feature extraction, structured VLM self-assessment, and ensemble-based feature fusion.

Stage 1: Blank and noise detection. Blank and noise inputs are detected using image statistics, including the global standard deviation of grayscale intensities, patch-variance coefficient of variation, and spectral flatness. This stage serves as a high-recall safeguard for non-informative inputs, allowing subsequent stages to focus on the more challenging semantic distinction between Related and Unrelated-Real image–question pairs.

Stage 2: Domain-adaptive CLIP routing. A single CLIP embedding space may not provide uniformly reliable image–text relevance estimates across heterogeneous visual domains Radford et al. (2021); Zhang et al. (2023). Rather than performing fine-grained domain assignment, we use a coarse distinction between medical and natural images. Specifically, we compute zero-shot prompt probabilities over separate medical and natural prompt sets. Let $p_{m}$ and $p_{n}$ denote the normalized probabilities assigned to the medical and natural groups, respectively. These probability features guide embedding-space routing and are also used as inputs to the final ensemble classifier.

Stage 3: TC-LIA. Let $z_{\ell,0:N}=f_{\ell}(x)$ denote the token representations extracted from layer $\ell$ of the CLIP ViT vision encoder, where $z_{\ell,0}$ is the CLS token and $z_{\ell,1:N}$ are patch tokens. Let $t(q)$ be the normalized CLIP text embedding of question $q$ . To compare intermediate vision tokens with the text embedding, each token is mapped into the CLIP joint embedding space using the fixed CLIP visual post-normalization and projection:

	$\displaystyle P(z_{\ell,i})$	$\displaystyle=\mathrm{norm}\!\left(\mathrm{LN}_{\mathrm{post}}(z_{\ell,i})W_{v}\right),$
	$\displaystyle s_{\ell,i}$	$\displaystyle=\cos\!\left(P(z_{\ell,i}),t(q)\right).$

Here, $W_{v}$ is the CLIP visual projection matrix and $\mathrm{norm}(\cdot)$ denotes $\ell_{2}$ normalization. The projection is fixed and is not learned during detector training.

Why does a final-layer projection work on intermediate tokens?

$W_{v}$ was trained to map the final visual representation into the CLIP joint embedding space, not to optimally project arbitrary intermediate patch tokens. Applying it to intermediate-layer tokens should therefore be viewed as an approximate readout rather than a claim that those tokens are fully CLIP-aligned. Nevertheless, we empirically demonstrate the utility of this approximation in Fig. 3(a): deeper layers inhabit the same evolving residual feature space that ultimately feeds the CLIP readout, so question-relevant patch tokens progressively move toward semantically aligned directions even before the final layer. Crucially, TC-LIA does not require exact calibration of intermediate cosine values; it relies on relative trajectory statistics, namely late-layer top- $k$ alignment, early-to-late gain, and slope (Fig. 2), which remain discriminative even when the projection is approximate. In this sense, $W_{v}$ acts as a shared semantic ruler across layers - not a layer-optimal projector, but a consistent probe whose output trajectory separates related from unrelated image–question pairs. Empirical validation of this approximation, including layer-wise alignment calibration and CKA analysis confirming content-neutrality, is provided in Appendix B.3 and B.4.

For each layer, we exclude the CLS token and summarize local image evidence using the mean of the top- $k$ patch–text similarities:

\small a_{\ell}=\frac{1}{k}\sum_{i\in\mathrm{TopK}_{k}(s_{\ell,1:N})}s_{\ell,i},\qquad k=10.

Let $\mathcal{L}_{E}$ and $\mathcal{L}_{L}$ denote the first and second halves of the captured vision-encoder layers. We compute

\small a_{E}=\frac{1}{|\mathcal{L}_{E}|}\sum_{\ell\in\mathcal{L}_{E}}a_{\ell},\qquad a_{L}=\frac{1}{|\mathcal{L}_{L}|}\sum_{\ell\in\mathcal{L}_{L}}a_{\ell}.

The scalar TC-LIA features are

	$\displaystyle\mathrm{late}$	$\displaystyle=a_{L},\hskip 18.49988pt\mathrm{gain}=a_{L}-a_{E},$
	$\displaystyle\mathrm{slope}$	$\displaystyle=\mathrm{LinearSlope}(a_{1},\ldots,a_{L}).$

Further, we compute the standard final CLIP image–text similarity $\mathrm{final\_cos}=\cos(v(x),t(q))$ , where $v(x)$ is the normalized final CLIP image embedding. The composite Internal Alignment Score is,

	IAS	$\displaystyle=50\times\mathrm{final\_cos}+25\times\mathrm{late}$
		$\displaystyle\qquad+15\times\mathrm{gain}+10\times\mathrm{slope}.$

The weights in the IAS formula are empirically validated and shown to be robust to perturbation; see Appendix LABEL:app:ias_weights for a sensitivity analysis confirming that even learned weights closely match these fixed values. The final TC-LIA feature set consists of $\mathrm{final\_cos}$ , $\mathrm{late}$ , $\mathrm{gain}$ , $\mathrm{slope}$ , and $\mathrm{IAS}$ , which are passed to the ensemble classifier. The overall TC-LIA workflow and its layer-wise alignment features are illustrated in Fig. 2. The theoretical motivation in Section 5 formalizes why late-layer alignment, early-to-late gain, and slope are expected to be informative: related pairs should exhibit emerging localized patch–text evidence in deeper layers, whereas unrelated or blank/noise inputs should lack a consistent late-layer alignment trajectory.

Stage 4: Structured VLM self-assessment. The VLM receives a structured prompt (Fig. 8) requiring a class label and an answer. The predicted class is encoded as a feature. If the ensemble predicts Related, the answer is released; otherwise, it is replaced by a refusal response. This prevents over-reliance on the VLM’s self-refusal behavior.

Stage 5: Feature fusion. The final feature vector includes pixel statistics, domain-routing outputs, TC-LIA scalar features, and the VLM class encoding. We train XGBoost as the primary classifier Chen and Guestrin (2016) and compare it with LightGBM Ke et al. (2017), Gradient Boosting, AdaBoost, Random Forest Breiman (2001), rule-based fusion, CLIP-only, TC-LIA-only, and VLM-only baselines. The complete inference procedure is summarized in Algorithm 1, with implementation details provided in Appendix A.

5 Theoretical Motivation

The central assumption of the proposed TC-LIA method is that related pairs exhibit an increase in localized semantic alignment in later vision layers, whereas non-related pairs may show generic similarity or spurious attention but lack consistent late-layer evidence.

Lemma 1: late-layer alignment separation.

Let $a_{L}$ be the late-layer top- $k$ patch alignment. Suppose $a_{L}|R$ and $a_{L}|U$ are sub-Gaussian with means $\mu_{R}$ and $\mu_{U}$ , common proxy variance $\sigma^{2}$ , and margin $\Delta=\mu_{R}-\mu_{U}>0$ . Then the threshold classifier $\hat{y}=R$ iff $a_{L}>(\mu_{R}+\mu_{U})/2$ has error at most $\exp(-\Delta^{2}/(8\sigma^{2}))$ for each class.

Proof sketch. Apply standard sub-Gaussian tail bounds to $P(a_{L}\leq\tau|R)$ and $P(a_{L}>\tau|U)$ with $\tau=(\mu_{R}+\mu_{U})/2$ .

Lemma 2: gain cancels layer-invariant shortcuts.

Assume $a_{\ell}(x,q)=c(x,q)+r_{\ell}(x,q)+\epsilon_{\ell}$ , where $c$ is a layer-invariant global image–text prior, $r_{\ell}$ is localized evidence that emerges in late layers, and $\epsilon_{\ell}$ is noise. Then $\mathrm{gain}=\mathrm{late}-\mathrm{early}$ cancels $c$ and estimates the emergence of localized evidence.

Proposition 1: staged blank gating decomposes mirage risk.

Let $g_{B}$ be the blank/noise gate and $g_{N}$ be the non-blank related/unrelated detector. The total mirage risk satisfies

\mathrm{MR}(g)\leq P(B)\epsilon_{B}+P(U)\epsilon_{U},

(2)

where $\epsilon_{B}=P(g_{B}\neq B|B)$ and $\epsilon_{U}=P(g_{N}=R|U)$ . This decomposition motivates the architecture in Fig. 1: a lightweight high-recall blank/noise stage reduces the first term, while TC-LIA and ensemble fusion target the harder semantic mismatch term. The theory therefore supports the design principle behind our system - mirage detection requires both low-level input validity checks and layer-wise semantic evidence alignment. Detailed proofs of the theoretical statements are provided in Appendix D.

6 Experiments

Datasets.

Following Liu et al. (2026), we evaluate on five domains: chest VQA, pathology VQA He et al. (2020), TextVQA Singh et al. (2019), DocVQA Mathew et al. (2021), and InfoVQA Mathew et al. (2022). Each base item is expanded into three conditions - related real image, unrelated real image, and blank/noise image. The training set for the ensemble contains 100 samples per domain per condition, while remaining samples form the held-out test set. Dataset details are provided in Appendix A.1.

Models and Baselines.

We evaluate twelve open VLMs with complete five-domain coverage: Qwen2.5-VL-7B, 32B and 72B Bai et al. (2025), BLIP2-OPT-2.7B, Gemma-3-4B-IT Gemma Team (2025), Phi-3.5-Vision, LLaVA-Next-110B, LLaVA-v1.6-34B Liu et al. (2024), InternVL3-38B Wang et al. (2025), MiniCPM-V-2.6, Aya-Vision-32B, LLaMA-3.2-90B; spanning 2.7B–110B parameters across the LLaVA, LLaMa, InternVL, Qwen-VL, Gemma, and BLIP families.

Metrics.

A trivial detector could minimize mirage rate by refusing every input. Therefore, we report three-class accuracy, macro-F1, related recall, mirage rate, AUROC for binary related-versus-nonrelated detection, and answer quality of the eventual response using BLEU, ROUGE-L, and BERTScore F1 Papineni et al. (2002); Lin (2004); Zhang et al. (2020). Implementation details are provided in Appendix A.2. Our anonymized code is provided here: https://anonymous.4open.science/r/Mirage_Detection_in_VLMS-779D/.

7 Results

Layer-wise alignment separates related and non-related inputs.

Figure 3(a) visualizes the core empirical signal behind TC-LIA. Averaged across domains, Related image–question pairs develop a stronger late-layer top- $k$ patch–text alignment trajectory, whereas Unrelated-Real inputs remain comparatively flat and Blank/Noise inputs show unstable or non-semantic alignment. This pattern directly supports the use of late-layer alignment, early-to-late gain, and slope as features.

Main mirage detection performance.

Table 1 reports mirage detection performance across VLM families and detector variants. Across models, the base prompt mirage rates span 21.7–66.6%, confirming that the VLMs often answer even when the image is unrelated or non-informative. In contrast, TC-LIA and ensemble fusion substantially reduce mirage rate to 2.7–3.3% across all models. The highest accuracy is obtained with Qwen2.5-VL-32B (RandomForest): 94.7% accuracy, while Qwen2.5-VL-72B achieves a slightly lower mirage rate (2.8%). Figure 4 visualizes the reduction from base prompt to ensemble across all twelve models. Additionally, Fig. 6 shows that all evaluated VLM backbones achieve low mirage rates after ensemble fusion, while accuracy varies across models. The best trade-off is obtained by Qwen2.5-VL-72B, which lies closest to the upper-left region with 94.6% accuracy and a 2.8% mirage rate.

Table 1: Main mirage detection results across twelve VLM backbones. MR denotes mirage rate. TC-LIA Only results are VLM-agnostic (fixed IAS threshold on the shared CLIP encoder), so Acc and MR are identical across backbones. TC-LIA + Ensemble reduces MR to 2.7–3.3% across all models.

VLM Backbone	Base MR $\downarrow$	TC-LIA Only		TC-LIA + Ensemble (Ours)
VLM Backbone	Base MR $\downarrow$	Acc $\uparrow$	MR $\downarrow$	Acc $\uparrow$	MR $\downarrow$	Macro-F1 $\uparrow$
Qwen2.5-VL-32B	57.4%	90.6%	3.4%	94.7%	3.0%	0.947
Qwen2.5-VL-72B	63.6%	90.6%	3.4%	94.6%	2.8%	0.946
LLaMA-3.2-90B	26.0%	90.6%	3.4%	94.1%	2.7%	0.941
Aya-Vision-32B	21.7%	90.6%	3.4%	93.9%	3.2%	0.939
Qwen2.5-VL-7B	63.6%	90.6%	3.4%	93.7%	3.2%	0.937
Gemma3-4B	58.8%	90.6%	3.4%	92.8%	3.1%	0.928
LLaVA-v1.6-34B	24.9%	90.6%	3.4%	92.0%	3.3%	0.920
LLaVA-Next-110B	63.6%	90.6%	3.4%	92.0%	3.0%	0.920
InternVL3-38B	27.1%	90.6%	3.4%	91.4%	2.8%	0.914
MiniCPM-V-2.6	66.6%	90.6%	3.4%	91.1%	3.1%	0.910
Phi-3.5-Vision	61.9%	90.6%	3.4%	91.0%	3.0%	0.910
BLIP2-2.7B	58.0%	90.6%	3.4%	91.0%	3.0%	0.910

Why the five TC-LIA features?

Table 2 summarizes the individual relevance of the scalar alignment features. The composite Internal Alignment Score (IAS) improves over standard final CLIP cosine similarity, while slope and gain provide complementary information about whether evidence emerges across depth. These results justify using five features: final cosine, late top- $k$ alignment, gain, slope, and IAS, rather than a single final embedding similarity.

IAS provides class-separating evidence.

Figure 3(b) shows that Related image–question pairs are shifted toward higher IAS, indicating stronger question-conditioned visual evidence. In contrast, Unrelated-Real examples concentrate at lower IAS values, while Blank/Noise inputs form a narrower intermediate distribution. This separation supports IAS as a discriminative signal for identifying answerable inputs and rejecting mirage-prone cases. Detailed domain-wise results are in B.7.

Qualitative results.

Figure 5 illustrates the practical behavior of the proposed detector on matched and mismatched inputs using the same medical question. For the related CT image, the ensemble preserves the VLM answer because the visual evidence is consistent with the question. In contrast, when the same question is paired with an unrelated real image, the raw VLM still produces a plausible medical-style answer, but the ensemble correctly rejects the input and replaces the response with a refusal. This example highlights the central goal of mirage detection: preventing visually unsupported answers before they are released. Additional examples are provided in Appendix Figs. 27– 34.

Ensemble feature importance.

Appendix Fig. 16 shows that IAS is the most important XGBoost feature, followed by VLM class encoding and final CLIP cosine similarity. This supports the role of TC-LIA as a complementary signal beyond both VLM self-assessment and standard global CLIP similarity.

Score / feature	Interpretation	AUROC $\uparrow$
Late top- $k$ mean	late local evidence	0.822
Gain	early-to-late growth	0.876
Slope	layer-wise trend	0.882
IAS	weighted composite	0.938

Table 2: TC-LIA feature comparison for llava-onevision-qwen2-7b-si-hf. IAS produces the strongest composite scalar relevance score.

Variant	Acc	Macro-F1	Mirage Rate	AUROC
No late top- $k$	94.1	0.941	3.1	0.983
No gain/slope	94.0	0.940	3.2	0.983
No VLM class feature	90.2	0.902	3.9	0.963
No Stage-1 blank gate	94.4	0.944	3.0	0.987
Full XGBoost (11 feat.)	94.6	0.946	2.8	0.986

Table 3: Ablation on Qwen2.5-VL-72B: removing the VLM class feature causes the largest accuracy drop.

Error structure.

Blank/noise inputs are detected with near-perfect recall in our approach. We therefore do not present blank detection as the main source of novelty. The dominant errors are related–unrelated confusions, indicating that the remaining challenge is semantic mismatch rather than low-level image corruption.

Generalisation to the MIRAGE benchmark datasets.

Beyond our primary five-domain evaluation, we assess whether TC-LIA generalises to the broader set of benchmarks introduced by Asadi et al. (2026). We construct a nine-domain evaluation incorporating VQA-RAD, PathVQA, TextVQA, DocVQA, InfographicVQA, MicroVQA, MedXpertQA, MMMU-Pro, and VideoMMMU, using out-of-domain images as the non-real condition and evaluating Qwen2.5-VL-7B. Under the base prompt the model produces a mirage rate of 33.3% — one in three out-of-domain images elicits a hallucinated response. The TC-LIA ensemble reduces this to 0.26% with 90.1% three-class accuracy, demonstrating consistent generalisation across diverse visual modalities spanning radiology, pathology, document understanding, microscopy, expert MCQ, and video question answering.

Answer quality on accepted related inputs.

Figure 7 summarizes answer quality averaged over all three conditions. TC-LIA + Ensemble substantially improves ROUGE-L and BERTScore across all models by producing well-formed refusals for Unrelated and Blank inputs.

Ablation results.

Table 3 reports the contribution of each system component on Qwen2.5-VL-72B. Removing late top- $k$ or gain/slope each costs roughly 0.4–0.6% accuracy. The largest single-feature drop comes from removing the VLM class encoding (90.2%, MR 3.9%), which shows that structured VLM self-assessment is the most complementary signal to TC-LIA. Removing Stage-1 blank gating has minimal impact on accuracy (94.4%) but slightly worsens mirage rate (3.0%), confirming its role as a precision guard rather than a recall booster. The full 11-feature XGBoost ensemble achieves 94.6% accuracy, 0.946 macro-F1, and 2.8% mirage rate. Ablation results in Appendix C show that the full ensemble generalizes across held-out domains and VLM backbones, while removing the structured VLM class feature or TC-LIA components increases mirage rate and reduces detection robustness. Per-domain ablations in Fig. 40 further show that the ensemble maintains strong accuracy across all five evaluation domains.

Qualitative grounding and negative probes.

In the appendix, we report several approaches that were considered and rejected as standalone solutions: toy CLIP relevance scoring (E.1), GradCAM/attention-only metrics (E.2), SAM3-style grounding (G), RAPT image/question attention ratios (E.3), and small-scale easy datasets where final cosine performed nearly perfectly (E.4). They support that mirage detection requires text-conditioned semantic evidence rather than attention concentration or output overlap alone.

8 Conclusion

We introduced TC-LIA, a text-conditioned layer-wise internal alignment method for detecting mirage-prone image–question pairs before VLM generation. By combining blank/noise detection, domain-adaptive CLIP routing, layer-wise patch–text alignment, structured VLM self-assessment, and XGBoost fusion, the proposed system substantially reduces mirage rate while preserving answer quality on related inputs. These results support a broader principle for safe multimodal deployment: VLMs should verify that question-relevant visual evidence is present before answering.

9 Limitations

The detector reduces mirage risk but does not certify that the downstream VLM answer is correct. Passing the detector means that the image appears relevant to the question, not that the generated answer is clinically or factually valid. The method requires access to intermediate ViT tokens, making it easier to implement for open CLIP-like encoders than closed models. Projecting intermediate-layer patch tokens with the fixed final CLIP readout is an approximation; these tokens were not directly trained to be layer-wise optimal CLIP embeddings, and TC-LIA relies on the resulting alignment trajectory as a diagnostic signal rather than treating projected intermediate features as exact CLIP-space representations. The fixed TC-LIA weights may be suboptimal for some domains. Finally, SAM3 grounding is currently used only for visualization; quantitative integration of segmentation-based grounding remains future work.

Ethics Statement

This work targets safer deployment of VLMs by reducing unsupported answers in medical and document VQA. The detector should not be used as a standalone clinical decision system. It is a pre-release safety layer intended to trigger abstention or human review when visual evidence is missing or mismatched. All medical data should be de-identified and used according to applicable licenses and institutional requirements.

References

A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi (2018) Don’t just assume; look and answer: overcoming priors for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4971–4980. Cited by: §2.
G. Alain and Y. Bengio (2016) Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644. Cited by: §2.
M. Asadi, J. W. O’Sullivan, F. Cao, T. Nedaee, K. Rajabalifardi, F. Li, E. Adeli, and E. Ashley (2026) Mirage: the illusion of visual understanding. arXiv preprint arXiv:2603.21687. Cited by: §1, §7.
J. Bai, S. Xie, Y. Li, Z. Chen, Y. Zhang, J. Wang, Y. Su, and X. Shen (2024) Hallucination of multimodal large language models: a survey. arXiv preprint arXiv:2404.18930. Cited by: §2.
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. (2025) Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923. Cited by: §6.
Y. Belinkov (2022) Probing classifiers: promises, shortcomings, and advances. Computational Linguistics 48 (1), pp. 207–219. Cited by: §2.
L. Breiman (2001) Random forests. Machine Learning 45 (1), pp. 5–32. Cited by: §4.
M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin (2021) Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9650–9660. Cited by: §2.
T. Chen and C. Guestrin (2016) XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. Cited by: §A.2, §4.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, Cited by: §1.
Y. Geifman and R. El-Yaniv (2017) Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
Gemma Team (2025) Gemma 3 technical report. arXiv preprint arXiv:2503.19786. Cited by: §6.
Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the v in VQA matter: elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6904–6913. Cited by: §2.
X. He, Y. Zhang, L. Mou, E. Xing, and P. Xie (2020) PathVQA: 30000+ questions for medical visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 646–647. Cited by: §2, §6.
G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V. Shankar, H. Namkoong, J. Miller, H. Hajishirzi, A. Farhadi, and L. Schmidt (2021) OpenCLIP External Links: Link Cited by: §A.2.
S. Kadavath et al. (2022) Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221. Cited by: §2.
G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Liu (2017) LightGBM: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, Vol. 30. Cited by: §A.2, §4.
J. Li, D. Li, S. Savarese, and S. C. H. Hoi (2022) Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10965–10975. Cited by: §2.
Y. Li, Y. Du, K. Kuang, W. X. Zhao, H. Xie, D. Yin, and J. Wen (2023) Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 292–305. Cited by: §2.
C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, pp. 74–81. Cited by: §6.
H. Liu, C. Li, Y. Li, and Y. J. Lee (2024) Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306. Cited by: §6.
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang (2023) Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499. Cited by: §2.
Z. Liu, Z. Chen, H. Liu, C. Luo, X. Tang, S. Wang, J. Zeng, Z. Dai, Z. Shi, T. Wei, H. Lu, B. Dumoulin, and H. Tong (2026) Seeing but not believing: probing the disconnect between visual attention and answer correctness in vlms. In International Conference on Learning Representations, Cited by: §E.3, Appendix F, §1, §2, §6.
M. Mathew, V. Bagal, R. Tito, D. Karatzas, E. Valveny, and C. V. Jawahar (2022) InfographicVQA. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1697–1706. Cited by: §2, §6.
M. Mathew, D. Karatzas, and C. V. Jawahar (2021) DocVQA: a dataset for vqa on document images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2200–2209. Cited by: §2, §6.
K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. Cited by: §6.
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, Vol. 32. Cited by: §A.2.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011) Scikit-learn: machine learning in python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §A.2.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever (2021) Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. Cited by: §1, §2, §4.
M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy (2021) Do vision transformers see like convolutional neural networks?. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach (2019) Towards vqa models that can read. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8317–8326. Cited by: §2, §6.
I. Tenney, D. Das, and E. Pavlick (2019) BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 4593–4601. Cited by: §2.
T. Thrush, R. Jiang, M. Bartolo, A. Singh, A. Williams, D. Kiela, and C. Ross (2022) Winoground: probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5238–5248. Cited by: §2.
W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, et al. (2025) InternVL3.5: advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265. Cited by: §6.
M. Xiong et al. (2024) Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in large language models. In International Conference on Learning Representations (ICLR), Cited by: §2.
S. Zhang, Y. Xu, N. Usuyama, H. Xu, J. Bagga, R. Tinn, S. Preston, R. Rao, M. Wei, N. Valluri, C. Wong, M. P. Lungren, T. Naumann, and H. Poon (2023) A multimodal biomedical foundation model trained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915. Cited by: §4.
T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi (2020) BERTScore: evaluating text generation with bert. In International Conference on Learning Representations, Cited by: §6.

Appendix

Appendix A Reproducibility Details

A.1 Dataset Composition

Table 4 provides the detailed dataset composition used in our mirage detection experiments. We evaluate across five visually diverse VQA domains: Chest VQA, PathVQA, TextVQA, DocVQA, and InfoVQA. For each domain, examples are organized into three input conditions: Related, where the image is semantically matched to the question; Unrelated-Real, where the image is visually valid but does not contain evidence relevant to the question. Unrelated-Real images are sampled cross-domain: for a question from domain $d$ , the paired unrelated image is drawn uniformly at random from a different domain (e.g., a natural scene image paired with a medical question), ensuring visual plausibility without semantic relevance; and Blank/Noise, where the visual input is non-informative. This construction allows us to evaluate not only whether the detector preserves answerability for genuinely related image–question pairs, but also whether it can reject both low-level invalid inputs and harder semantic mismatches. The training set contains 100 samples per domain per condition, while the remaining examples are used for held-out evaluation.

Domain	Train	Test	Total
Chest VQA	300	1,053	1,353
PathVQA	300	$\sim$ 2,700	$\sim$ 3,000
TextVQA	300	$\sim$ 2,700	$\sim$ 3,000
DocVQA	300	$\sim$ 2,700	$\sim$ 3,000
InfoVQA	300	3,300	3,600
Total	1,500	12,453	13,953

Table 4: Dataset composition across domains and splits. Each domain contains Related, Unrelated-Real, and Blank/Noise conditions.

A.2 Implementation Summary

The full system uses Python, PyTorch Paszke et al. (2019), OpenCLIP Ilharco et al. (2021), scikit-learn Pedregosa et al. (2011), XGBoost Chen and Guestrin (2016), and LightGBM Ke et al. (2017). All experiments use a fixed random seed of 42. Our anonymized code is provided here: https://anonymous.4open.science/r/Mirage_Detection_in_VLMS-779D/.

TC-LIA feature extraction.

TC-LIA registers forward hooks on all 32 transformer blocks of a frozen ViT-H/14 CLIP encoder. For each non-blank image–question pair, the image is passed through the vision encoder; intermediate patch tokens are projected into the shared CLIP embedding space via the model’s ln_post and projection matrix, and cosine similarity is computed against the text embedding. At each layer, the mean cosine similarity of the top- $k$ ( $k{=}10$ ) patch tokens is recorded, yielding a 32-point top- $k$ curve. The scalar feature internal_alignment_score is then computed as a fixed linear combination:

\mathrm{IAS}=0.50\,f_{\cos}+0.25\,\bar{s}_{\mathrm{late}}+0.15\,\Delta_{\mathrm{gain}}+0.10\,\beta_{\mathrm{slope}}

(3)

where $f_{\cos}$ is the final-layer cosine similarity, $\bar{s}_{\text{late}}$ is the mean of the top- $k$ curve over the second half of layers, $\Delta_{\text{gain}}$ is the difference between late-layer and early-layer means, and $\beta_{\text{slope}}$ is the slope of a linear fit over the full 32-point curve.

Domain classification uses a zero-shot CLIP prompt bank (16 medical templates and 13 natural-scene templates) with a softmax temperature of 0.07. A domain is labelled medical if the aggregated medical probability exceeds 0.65, natural if the natural probability exceeds 0.65, and mixed otherwise. Blank/noise images bypass the cosine pipeline entirely; their features are derived from grayscale pixel statistics (image_is_blank, image_std).

Dataset split.

Each of the five domains (ChestVQA, PathVQA, TextVQA, DocVQA, InfoVQA) contributes 100 real-image samples to the training set, drawn uniformly at random (seed 42); all remaining samples form the held-out test set. Three image types are used: real, unrelated, and blank/noise.

Ensemble classifiers and hyperparameters.

Five classifiers are trained independently on the 11 TC-LIA features per VLM backbone:

•

XGBoost: 300 estimators, max depth 6, learning rate 0.05, subsample 0.8, column subsample 0.8, log-loss evaluation metric.
•

LightGBM: 300 estimators, max depth 6, learning rate 0.05, subsample 0.8, column subsample 0.8.
•

Gradient Boosting: 200 estimators, max depth 5, learning rate 0.05, subsample 0.8.
•

AdaBoost: 200 estimators, learning rate 0.5, base learner is a decision stump with max depth 3.
•

Random Forest: 300 estimators, max depth 8.

Blank images are classified deterministically as BLANK_OR_NOISE without invoking any classifier; the ensemble operates only on non-blank samples. Model selection is performed via 5-fold stratified cross-validation (StratifiedKFold, shuffle=True, seed 42) on the training split; the classifier with the highest mean CV accuracy is selected as the best ensemble for each VLM backbone.

A.3 Structured VLM Prompt

For the structured VLM output, we used a deterministic structured output prompt as shown in Fig. 8.

You are a visual analysis assistant.Examine the image carefully, then respond using EXACTLY this format: IMAGE_CLASS: <RELATED | UNRELATED_REAL | BLANK_OR_NOISE>
ANSWER: <your response> Classification rules: • RELATED $\rightarrow$ real image containing evidence relevant to the question • UNRELATED_REAL $\rightarrow$ real image whose content does NOT match the question topic • BLANK_OR_NOISE $\rightarrow$ blank (black/white/grey/uniform) or random pixel noise Answer rules: • If RELATED $\rightarrow$ answer directly from what you see; multiple sentences are fine if the question requires detail • If UNRELATED_REAL or BLANK_OR_NOISE $\rightarrow$ write exactly: “I cannot answer based on the given image.” Question: {question}

Figure 8: Structured deterministic prompt used for VLM evaluation.

A.4 Algorithm

Algorithm 1 summarizes the runtime inference procedure for the proposed mirage detection framework. Given an image–question pair, the system first applies a lightweight blank/noise screen. Non-blank inputs are then processed through CLIP-based domain routing and TC-LIA feature extraction. The VLM is queried only for a structured preliminary class label and candidate answer, and the final decision is made by the ensemble classifier using pixel, routing, alignment, and VLM-derived features. If the ensemble predicts Related, the VLM answer is released; otherwise, the system returns a refusal response.

Input: Image

x

, question

q

, VLM

M

, CLIP encoder

C

, trained ensemble

E

Output: Decision

y\in\{\textsc{Related},\textsc{Unrelated-Real},\textsc{Blank/Noise}\}

and released response

3Compute pixel statistics

b(x)

5if $b(x)$ indicates blank/noise with high confidence then

6 return Blank/Noise and refusal response

9Compute domain probabilities

(p_{m},p_{n})

using CLIP prompt routing

11Extract layer-wise patch tokens

z_{\ell,i}

from the CLIP ViT encoder

13Compute patch–text similarities

s_{\ell,i}\leftarrow\cos(Pz_{\ell,i},t(q))

15Compute TC-LIA features: final cosine, late top-

k

, gain, slope, and IAS

17Query the VLM for a structured class label and candidate answer

19Fuse pixel, routing, TC-LIA, and VLM features using ensemble

E

21if $E(x,q)=\textsc{Related}$ then

22 return Related and the VLM answer

24else

25 return predicted non-related class and refusal response

Algorithm 1 Runtime Mirage Detection with TC-LIA

Appendix B Additional TC-LIA Analyses

This section provides additional analyses that are useful for interpreting the main results but are placed in the appendix to preserve space in the eight-page paper. These analyses explain why the five scalar TC-LIA features were chosen, how domain routing affects final cosine similarity, and how the detector affects answer quality on genuinely related inputs.

B.1 Rationale for the Five TC-LIA Features

The five scalar TC-LIA features (Table 5) were selected from thirteen candidate features by evaluating each candidate’s ability to separate related from unrelated image–question pairs using AUROC on a controlled 200-pair dataset (100 related, 100 unrelated, balanced across natural and medical domains). The thirteen candidates spanned three families: CLS-based similarity (late_cls_sim_mean, slope_cls_sim), patch top- $k$ alignment (late_patch_topk_mean, gain_patch_topk_late_minus_early, slope_patch_topk), and attention-weighted patch alignment (late_attn_weighted_sim_mean, gain_attn_weighted_late_minus_early, slope_attn_weighted, late_attn_topk_semantic_mean).

CLS-based features were dropped because the CLS token aggregates global image information at the final layer and is therefore highly correlated with final_cos; adding them introduced redundancy without improving separability. Attention-weighted features were dropped because they inherit the same failure mode as raw RAPT: the encoder’s self-attention is not conditioned on the question text, so attention-weighted similarity reflects visual salience rather than question relevance. Patch top- $k$ features avoided this problem by selecting the patches with the highest cosine similarity to the text embedding directly, bypassing attention weights entirely.

The five retained features capture complementary and non-redundant evidence signals. final_cos measures global image–text compatibility as a strong baseline. late_patch_topk_mean measures whether localised question-relevant evidence appears in the deeper visual layers. gain_patch_topk_late_minus_early measures whether that evidence strengthens from early to late layers, suppressing layer-invariant shortcuts. slope_patch_topk summarises the global trajectory of alignment across all 32 layers. internal_alignment_score (IAS) combines these signals into a single compact scalar score:

\text{IAS}=0.50\cdot f_{\cos}+0.25\cdot\bar{s}_{\text{late}}+0.15\cdot\Delta_{\text{gain}}+0.10\cdot\beta_{\text{slope}}.

(4)

An internal-only logistic-regression probe trained exclusively on the patch top- $k$ features (excluding final_cos) achieved AUROC above 0.90, confirming that the layerwise trajectory carries separable information beyond global cosine similarity alone.

Feature	What it measures	Why it helps
final_cos	final global CLIP similarity	strong baseline relevance signal
late_patch_topk_mean	late local patch evidence	captures localized support
gain	late minus early alignment	suppresses global shortcuts
slope	trend across layers	captures evidence emergence
IAS	weighted composite	compact scalar decision score

Table 5: Interpretation of the five TC-LIA features used by the ensemble.

B.2 Handling the No-Image Case

When no image is supplied to the pipeline — for example, when an API call omits the image field or when the image file fails to load — rather than introducing a separate detection branch, we propose a simple unified treatment: define a zero tensor $\mathbf{0}$ of the expected spatial dimensions and compute $\tilde{x}=x\oplus\mathbf{0}$ , where $\oplus$ denotes element-wise addition. For an informative image $x$ , this operation is a no-op and the image passes through the pipeline unchanged. When no image is provided, $x$ itself is treated as $\mathbf{0}$ , yielding an all-black representation whose pixel statistics — near-zero grayscale standard deviation and zero patch-variance coefficient of variation — are immediately flagged by the Stage 1 blank/noise gate, triggering a refusal response. This approach requires no additional detection module: the no-image case is subsumed by the existing blank/noise safeguard at no extra cost.

B.3 TC-LIA Layer Projection Calibration

Figure 9 directly validates the two core design assumptions of TC-LIA: that the fixed final-layer projection $W_{v}$ provides a content-neutral readout of intermediate tokens, and that the discriminative signal is concentrated in the late vision-encoder layers.

Left panel: Layer-wise alignment trajectory.

The mean top- $k$ patch–text cosine similarity is plotted at each ViT-H/14 layer for Related (blue) and Unrelated (red) pairs pooled across all five evaluation domains. In early layers (0–14), both conditions produce nearly identical alignment values ( $\approx$ 0.11–0.12), providing no discriminative signal. In late layers (15–31), the Related curve rises steadily to $\approx$ 0.20–0.21, while the Unrelated curve remains flat at $\approx$ 0.12–0.14. The separation emerges precisely in the TC-LIA focus window (shaded region, layers 16–31), directly justifying the use of late_patch_topk_mean as a discriminative feature. A random-vector baseline ( $\approx$ 0.05, flat throughout) confirms that the observed separation is semantic rather than a geometric artifact of the $W_{v}$ projection.

Right panel: CKA validity check.

To verify that $W_{v}$ does not favor one image type over the other when projecting intermediate tokens, we compute Centered Kernel Alignment (CKA) between the projected intermediate representations and the final-layer representations, separately for Related and Unrelated pairs. Both curves are essentially identical, rising from $\approx$ 0.40 at layer 0 to $\approx$ 0.98 at layer 31, confirming that $W_{v}$ acts as a content-neutral semantic ruler: it does not introduce a systematic bias toward either condition, so any alignment difference in the left panel is purely semantic. Together, these results validate two properties required for TC-LIA: the $W_{v}$ approximation is content-neutral and geometrically consistent across layers, and the discriminative signal is concentrated in the late layers, motivating the late-layer focus of the TC-LIA feature set.

B.4 Representational Similarity Analysis: Late Layers Develop More Structured Representations

Figure 10 provides complementary evidence for the TC-LIA hypothesis through Representational Similarity Analysis (RSA). For each ViT-H/14 layer, we compute the Spearman correlation between the pairwise distance matrix of the projected patch representations and that of the final-layer representations (RSA to final-layer RDM), separately for Related and Unrelated pairs.

Left panel: Overall RSA trajectory.

Both curves converge to RSA $=1.0$ at layer 31 by construction. Across layers 5–25, the Unrelated curve (red) lies above the Related curve (blue): unrelated images reach their final representational structure earlier, requiring less late-layer processing. Related images continue to be refined in late layers, indicating that the encoder performs more semantic computation on them – consistent with the emergence of question-relevant evidence captured by TC-LIA.

Right panel: Per-domain late-layer RSA.

The mean late-layer RSA (layers 16–31) is reported per domain. Medical domains (chest_vqa, $\Delta=+0.093$ ; pathvqa, $\Delta\approx 0$ ) show the sharpest domain-level contrast. textvqa shows a negative gap ( $\Delta=-0.301$ ), as same-domain text-rich images share structural properties regardless of semantic relevance, consistent with TC-LIA’s reduced discriminative power in document-style domains. This per-domain variability motivates ensemble fusion rather than a single universal IAS threshold. Taken together, RSA provides convergent evidence at the representational level that related pairs undergo more active late-layer transformation than unrelated ones, confirming that the TC-LIA alignment rise reflects genuine semantic processing rather than a projection artifact.

B.5 In-Domain Hard Negatives: Scope of TC-LIA and Motivation for Ensemble Fusion

A key question for mirage detection is whether TC-LIA can distinguish a related image from an in-domain hard negative – an unrelated image drawn from the same visual domain (e.g., a different chest X-ray paired with a chest VQA question). Figure 11 reports a direct evaluation of this question across all five domains.

Top histograms: IAS distributions.

For all five domains, cross-domain negatives (red) are clearly left-shifted relative to Related (blue): IAS separates them well. In-domain hard negatives (orange), however, overlap substantially with Related, and for some domains (chest_vqa, pathvqa, docvqa), the orange distribution is right-shifted past the related distribution. Images from the same domain share domain-level visual-text alignment with the question by virtue of domain priors alone, preventing IAS from distinguishing them.

Bottom chart: AUROC by negative type.

TC-LIA IAS achieves strong AUROC against cross-domain negatives (range $0.903$ – $0.987$ , mean $\approx 0.928$ ) but falls to near-chance in 4 of 5 domains against in-domain hard negatives ( $0.451$ – $0.555$ ), with the exception of infovqa ( $0.772$ ). This confirms that TC-LIA captures domain-level visual–text alignment rather than instance-level semantic matching.

Why this limitation motivates the ensemble.

This result is not a failure but an honest characterization of scope, and it directly motivates the full pipeline design. TC-LIA provides a powerful, efficient signal for the common cross-domain mismatch case. The VLM structured self-assessment (Stage 4) covers the harder in-domain cases, where semantic mismatches cannot be resolved from patch-level alignment alone. The complementary coverage of each component is why the full 11-feature ensemble (Table 3) substantially outperforms TC-LIA IAS alone and why removing the VLM class feature causes the largest single-component accuracy drop.

It is important to note that in-domain hard negatives are not inherently mirage cases in the traditional sense. Unlike cross-domain unrelated inputs, they are valid images from the correct visual domain; a model may still produce a plausible answer not by hallucinating from nothing, but by relying on domain-level priors rather than the specific image content. Such cases are better characterised as domain-prior-driven wrong answers rather than visually ungrounded hallucinations, and they represent a distinct and harder failure mode that motivates complementary semantic filtering via structured VLM self-assessment.

B.6 Encoder Generalization: TC-LIA Across Visual Backbones

TC-LIA’s design rests on two structural properties of the visual encoder: softmax contrastive training that aligns patch tokens with text embeddings, and a $W_{v}$ projection that faithfully maps intermediate tokens into the joint CLIP space. Figure 12 tests both properties by comparing the late-layer alignment gap ( $\Delta=\mu_{\text{related}}-\mu_{\text{unrelated}}$ over the final 20% of layers) across five encoders spanning different training objectives and architectures.

•

CLIP ViT-H/14 (laion2b), $\Delta=+0.0464$ . Strongest signal. Related alignment rises sharply after normalized depth $\approx$ 0.6 while Unrelated remains flat, producing the widening late-layer gap across samples. This result justifies CLIP ViT-H/14 as the primary TC-LIA backbone.
•

MetaCLIP ViT-H-14, $\Delta=+0.0180$ . Positive but weaker signal; separation appears only in the final 20% of layers. MetaCLIP uses the same softmax contrastive loss as CLIP on curated data: the same training objective reproduces the late-layer phenomenon, albeit attenuated.
•

SigLIP SO400M, $\Delta=+0.0098$ . Marginal signal despite being a large contrastively trained model. SigLIP uses sigmoid binary loss rather than softmax, which fundamentally changes how patch tokens align to text and suppresses the late-layer amplification TC-LIA depends on.
•

DINOv2-Large, $\Delta=-0.0021$ . Negative control. Both curves are nearly identical throughout all layers. DINOv2 has no text supervision; the random projection into text space produces noise. The near-zero gap confirms that TC-LIA requires text-visual co-training.
•

EVA-CLIP EVA02-L-14, $\Delta=-0.0012$ . Despite contrastive training, near-zero signal with very wide confidence intervals. The $W_{v}$ approximation is likely lossy for EVA-CLIP’s TimmModel architecture, where intermediate token geometry may not map cleanly into the text embedding space via ln_post and proj.

These results establish that TC-LIA’s signal is gated on softmax contrastive co-training (CLIP-style loss): the method generalizes to MetaCLIP ( $\Delta=+0.0180$ ) but degrades with sigmoid loss (SigLIP) or without text supervision (DINOv2), while EVA-CLIP’s architecture likely breaks the $W_{v}$ projection assumption. CLIP ViT-H/14 (laion2b) is empirically the optimal backbone, justified by direct comparison across encoder families.

B.7 Domain-wise TC-LIA Score and IAS Distributions

Figure 13 provides a domain-wise view of the TC-LIA alignment trajectory. Across all five domains, Related examples generally show increasing top- $k$ patch–text similarity in later layers, consistent with the emergence of question-relevant visual evidence. In contrast, Unrelated-Real inputs remain flatter and lower, indicating weaker semantic correspondence between the question and image. Blank/Noise inputs often show unstable or non-semantic intermediate behavior, but lack the sustained late-layer rise observed for related pairs. These patterns support the use of late alignment, gain, and slope as complementary TC-LIA features.

Figure 14 expands the aggregate IAS distribution in Fig. 3 (b) by showing each domain separately. The main pattern is consistent across datasets: Related examples tend to shift toward larger IAS values, Unrelated-Real examples occupy lower values, and Blank/Noise examples concentrate in a narrower intermediate range. The degree of separation varies by domain, motivating the use of ensemble fusion rather than a single universal IAS threshold.

B.8 Domain-Adaptive Routing

Domain-adaptive routing improves medical-domain cosine similarity by switching to BioMedCLIP when the image is confidently medical, but it can slightly degrade document or natural-image domains when mixed-content images are routed imperfectly. We therefore use routing probabilities and the encoded routing decision as features rather than treating routing as a hard final decision.

Domain	ViT-H/14	Adaptive	$\Delta$
Chest VQA	0.933	0.978	+0.044
PathVQA	0.835	0.882	+0.047
TextVQA	0.963	0.931	-0.032
DocVQA	0.886	0.869	-0.018
InfoVQA	0.979	0.972	-0.007

Table 6: Domain-adaptive final-cosine AUROC. Medical domains benefit from BioMedCLIP routing, whereas document/natural domains can be better served by the general ViT-H/14 embedding space.

B.9 Classifier Comparison

Table 7 compares ensemble choices. XGBoost gives the best accuracy–mirage tradeoff, while Gradient Boosting, LightGBM, AdaBoost, and Random Forest remain close. The rule-based baseline is useful as a sanity check but is weaker and was evaluated in-sample, so it should not be treated as the main comparator.

Classifier	Acc $\uparrow$	Mirage Rate $\downarrow$	Macro-F1 $\uparrow$
XGBoost	94.1	2.4	0.941
GradientBoosting	93.9	2.6	0.939
LightGBM	93.8	2.6	0.938
AdaBoost	93.7	2.6	0.937
RandomForest	93.3	2.6	0.933
Rule-based fusion	91.1	5.1	0.911

Table 7: Classifier comparison. Boosting-based ensembles dominate the rule-based baseline; the rule-based result is in-sample and should be interpreted as an optimistic baseline.

B.10 Answer Quality

Figure 15 reports answer quality across all three input conditions for both the base-prompt and ensemble-filtered settings. Each row corresponds to a condition among Related, Unrelated-Real, and Blank/Noise and each column reports a different metric: BLEU, ROUGE-L, and BERTScore F1. For Related inputs, answer quality is measured against ground-truth answers; for Unrelated-Real and Blank/Noise inputs, it is measured against a set of six canonical refusal phrases, so a higher score means the system correctly refused. TC-LIA + Ensemble matches or improves the base prompt in every cell of the $3\times 3$ grid. On Related inputs, ensemble filtering preserves answer quality, confirming that mirage suppression does not collapse into indiscriminate refusal. On Unrelated-Real and Blank/Noise inputs, the ensemble substantially raises refusal-match scores across all twelve VLM backbones, demonstrating that the detector reliably withholds responses when visual evidence is absent or mismatched.

B.11 Feature Importance

Feature importance is reported only as a diagnostic because tree-based importances are not causal explanations. Nevertheless, the rankings provide useful evidence that TC-LIA contributes signal beyond what either final cosine similarity or VLM self-assessment alone could provide.

Figure 16 shows XGBoost and LightGBM importances aggregated over Qwen2.5-VL-7B. In XGBoost (left), internal_alignment_score is the single most important feature by a wide margin (importance $\approx$ 0.32), followed by vlm_class_enc ( $\approx$ 0.17) and final_cos ( $\approx$ 0.15). LightGBM (right) shows a more distributed ranking: final_cos, gain_patch_topk_late_minus_early, internal_alignment_score, and slope_patch_topk are roughly tied at the top, with domain-routing features (pm, pn) contributing moderately. The two classifiers agree that the composite IAS, final cosine, and the gain/slope trajectory are the most informative TC-LIA signals, while blank-gate features (s1_is_blank) rank near the bottom because blank inputs are trivially handled before the ensemble is invoked.

Figure 39 breaks down XGBoost importance across five representative VLM backbones. The pattern is consistent: IAS and final_cos are top-two features for BLIP2-2.7B, Gemma-3-4B, Phi-3.5-Vision, and LLaVA-OV-7B. The notable exception is Qwen2.5-VL-7B, where vlm_class_enc dominates ( $\approx$ 0.47), indicating that Qwen’s structured responses are unusually discriminative and the ensemble relies heavily on self-assessment for that backbone. Across all five models, no single feature suffices alone, confirming that ensemble fusion over complementary signals is necessary for robust detection.

Appendix C Additional Ablation Studies

This section presents the complete per-ablation figures corresponding to the ablation study summarised in Table 3 of the main paper.

C.1 Leave-One-Domain-Out Generalization

We train the TC-LIA ensemble on four domains and evaluate on the held-out fifth, using Gemma-3-4B-IT as the VLM backbone. Figure 17 shows that 3-class accuracy ranges from 72.0% (PathVQA) to 96.3% (InfoVQA), and mirage rate stays between 1.4% and 6.7% across all held-out domains. ChestVQA is the hardest generalisation target (6.7% mirage rate) due to its distinctive medical visual distribution. These results confirm that TC-LIA features transfer across domains without domain-specific fine-tuning.

C.2 Leave-One-VLM-Out Generalization

We pool training data from eight VLMs and evaluate on the held-out ninth, testing two variants: the full 11-feature ensemble (with vlm_class) and the same ensemble with the VLM class encoding removed (no vlm_class). Figure 18 shows that accuracy remains consistently high (85–90%) across all held-out VLMs, confirming that the ensemble generalises across unseen model families. Removing the VLM class feature raises mirage rate noticeably for BLIP2-2.7B and Gemma-3-4B, the two models whose structured outputs differ most from the training pool, demonstrating that vlm_class_enc encodes model-specific response style that aids detection when that style is familiar. The gap is small for the larger models (AyaVision-32B, Qwen2.5-VL-32B), suggesting that TC-LIA features alone are sufficient when the VLM backbone is more capable.

Appendix D Proofs for Theoretical Motivation

This appendix provides full proofs for the theoretical statements in Section 5. These results are intended as formal motivation for the TC-LIA feature design rather than as guarantees for the full nonlinear ensemble classifier.

D.1 Proof of Lemma 1: Late-layer Alignment Separation

Let $a_{L}$ denote the late-layer top- $k$ patch alignment statistic. Assume that $a_{L}|R$ and $a_{L}|U$ are sub-Gaussian with means $\mu_{R}$ and $\mu_{U}$ , respectively, and with common proxy variance $\sigma^{2}$ . Thus, for any $t>0$ ,

\small P(a_{L}-\mu_{R}\leq-t\mid R)\leq\exp\!\left(-\frac{t^{2}}{2\sigma^{2}}\right),

and

\small P(a_{L}-\mu_{U}\geq t\mid U)\leq\exp\!\left(-\frac{t^{2}}{2\sigma^{2}}\right).

Let $\Delta=\mu_{R}-\mu_{U}>0$ and define the decision threshold

\small\tau=\frac{\mu_{R}+\mu_{U}}{2}.

The threshold classifier predicts $\hat{y}=R$ if $a_{L}>\tau$ and predicts $\hat{y}=U$ otherwise.

For a related pair, an error occurs when $a_{L}\leq\tau$ . Therefore,

	$\displaystyle P(\hat{y}\neq R\mid R)$	$\displaystyle=P(a_{L}\leq\tau\mid R)$
		$\displaystyle=P(a_{L}-\mu_{R}\leq\tau-\mu_{R}\mid R).$

Since

\small\tau-\mu_{R}=\frac{\mu_{R}+\mu_{U}}{2}-\mu_{R}=-\frac{\Delta}{2},

we have

\small P(\hat{y}\neq R\mid R)=P(a_{L}-\mu_{R}\leq-\Delta/2\mid R).

Applying the sub-Gaussian lower-tail bound with $t=\Delta/2$ gives

\small P(\hat{y}\neq R\mid R)\leq\exp\!\left(-\frac{\Delta^{2}}{8\sigma^{2}}\right).

Similarly, for an unrelated pair, an error occurs when $a_{L}>\tau$ . Thus,

	$\displaystyle P(\hat{y}\neq U\mid U)$	$\displaystyle=P(a_{L}>\tau\mid U)$
		$\displaystyle=P(a_{L}-\mu_{U}>\tau-\mu_{U}\mid U).$

Since

\small\tau-\mu_{U}=\frac{\mu_{R}+\mu_{U}}{2}-\mu_{U}=\frac{\Delta}{2},

we obtain

\small P(\hat{y}\neq U\mid U)=P(a_{L}-\mu_{U}>\Delta/2\mid U).

Applying the sub-Gaussian upper-tail bound with $t=\Delta/2$ gives

\small P(\hat{y}\neq U\mid U)\leq\exp\!\left(-\frac{\Delta^{2}}{8\sigma^{2}}\right).

Therefore, the per-class error of the threshold classifier is bounded by

\small\exp\!\left(-\frac{\Delta^{2}}{8\sigma^{2}}\right).

This proves Lemma 1.

D.2 Proof of Lemma 2: Gain Cancels Layer-invariant Shortcuts

Assume that the layer-wise alignment statistic can be written as

\small a_{\ell}(x,q)=c(x,q)+r_{\ell}(x,q)+\epsilon_{\ell},

where $c(x,q)$ is a layer-invariant global image–text prior, $r_{\ell}(x,q)$ is a localized evidence term that varies across layers, and $\epsilon_{\ell}$ is noise. Let $\mathrm{early}$ and $\mathrm{late}$ denote the average alignment over the early and late layer sets:

\small\mathrm{early}=\frac{1}{|\mathcal{L}_{E}|}\sum_{\ell\in\mathcal{L}_{E}}a_{\ell},\qquad\mathrm{late}=\frac{1}{|\mathcal{L}_{L}|}\sum_{\ell\in\mathcal{L}_{L}}a_{\ell}.

Substituting the decomposition of $a_{\ell}$ gives

\small\mathrm{early}=c(x,q)+r_{E}+\epsilon_{E},

and

\small\mathrm{late}=c(x,q)+r_{L}+\epsilon_{L},

where $r_{E}$ and $r_{L}$ are the average localized evidence terms in the early and late layers, and $\epsilon_{E}$ and $\epsilon_{L}$ are the corresponding average noise terms.

The gain feature is

	$\displaystyle\mathrm{gain}$	$\displaystyle=\mathrm{late}-\mathrm{early}$
		$\displaystyle=\left(c(x,q)+r_{L}+\epsilon_{L}\right)-\left(c(x,q)+r_{E}+\epsilon_{E}\right)$
		$\displaystyle=r_{L}-r_{E}+\epsilon_{L}-\epsilon_{E}.$

Thus, the layer-invariant term $c(x,q)$ cancels exactly. Consequently, the gain feature suppresses global similarity shortcuts that persist across layers and instead emphasizes the emergence of localized evidence from early to late layers. This proves Lemma 2.

D.3 Proof of Proposition 1: Staged Blank Gating Decomposes Mirage Risk

Let $B$ denote the event that the input is blank/noise, and let $U$ denote the event that the input is unrelated-real. A mirage error occurs when a non-related input is incorrectly passed as related. Therefore,

\small\mathrm{MR}(g)=P(g(x,q)=R,\;Y\in\{B,U\}).

Since $B$ and $U$ are disjoint classes,

	$\displaystyle\mathrm{MR}(g)$	$\displaystyle=P(g(x,q)=R,\;Y=B)$
		$\displaystyle\qquad+P(g(x,q)=R,\;Y=U).$

For a blank/noise input to be passed as related, it must first fail to be rejected by the blank/noise gate. Therefore,

\small P(g(x,q)=R,\;Y=B)\leq P(g_{B}\neq B,\;Y=B).

Using the product rule,

\small P(g_{B}\neq B,\;Y=B)=P(B)P(g_{B}\neq B\mid B).

By definition, $\epsilon_{B}=P(g_{B}\neq B\mid B)$ , so

\small P(g(x,q)=R,\;Y=B)\leq P(B)\epsilon_{B}.

For unrelated-real inputs, the relevant failure mode is that the non-blank related/unrelated detector passes the input as related. Thus,

\small P(g(x,q)=R,\;Y=U)\leq P(g_{N}=R,\;Y=U).

Again applying the product rule,

\small P(g_{N}=R,\;Y=U)=P(U)P(g_{N}=R\mid U).

By definition, $\epsilon_{U}=P(g_{N}=R\mid U)$ , so

\small P(g(x,q)=R,\;Y=U)\leq P(U)\epsilon_{U}.

Combining the two bounds gives

\small\mathrm{MR}(g)\leq P(B)\epsilon_{B}+P(U)\epsilon_{U}.

This proves Proposition 1. The result shows that the overall mirage risk can be reduced by separately controlling blank/noise failures through $g_{B}$ and semantic mismatch failures through $g_{N}$ , which matches the staged design of the proposed detector.

Appendix E Negative and Developmental Experiments

In this section, we describe orthogonal approaches to TC-LIA that were attempted before TC-LIA.

E.1 Toy CLIP/BioMedCLIP Relevance Scoring

We first constructed a small image–text relevance dataset containing natural images, medical images, related prompts, and unrelated prompts. Standard OpenCLIP/BioMedCLIP final cosine similarity performed perfectly on the tiny 16-pair setting. However, this was not evidence of a robust solution: the dataset was too small, the negatives were often easy cross-domain mismatches, and final cosine did not test whether internal localized evidence emerged across layers.

Takeaway. Initial CLIP-only experiments were useful as a sanity check but were not sufficient as a main method because they overestimated performance on small/easy negatives and did not explain layer-wise evidence emergence.

E.2 Layer-wise Attention and GradCAM-style Metrics

We next tested several layer-wise attention and GradCAM-style metrics. These included attention concentration, late-layer attention mass, Gini-like attention sparsity, text-conditioned GradCAM maps, and visualizations over related, unrelated, and blank images. These metrics produced interpretable plots, but they were not stable enough as standalone classifiers. Attention maps could be sparse for both correct and incorrect cases, and blank or unrelated cases could still produce visually salient but semantically meaningless hotspots.

Figure 19 shows text-conditioned GradCAM maps computed over the late six ViT-H/14 layers for image–question pairs across three conditions: related, unrelated-real, and blank. Two failure patterns are immediately visible. First, for unrelated inputs the GradCAM map fires on visually salient but semantically irrelevant regions — for example, activating on structural edges of a chest X-ray when the question concerns a rocket launch, or highlighting fur texture on a cat image when the question asks about lung opacity. The model attends to whatever is visually prominent, not to what is question-relevant. Second, for blank images GradCAM produces near-uniform or randomly scattered activations, confirming there is no stable grounding signal to detect.

These observations are confirmed quantitatively. Across nine prompts $\times$ three conditions (27 samples), GradCAM achieved only 22% three-way accuracy and AUROC 0.543 — barely above chance. The other six attention-only metrics (Raw RAPT, PIA, MHVF, ATS, LER, Gini) all scored below 0.640 AUROC, with most below 0.450. Attention rollout reached AUROC 0.750 but still failed to reliably separate unrelated-real from related inputs in per-prompt evaluation. These results confirm that visual attention in a CLIP-style encoder is not text-conditioned at the feature level: the encoder attends to visually salient patches regardless of whether they are semantically matched to the query. This motivated the shift to TC-LIA, which replaces attention weights with direct patch–text cosine similarity computed against the question embedding.

Conclusion: Attention/saliency is useful for qualitative diagnosis, but a scalar mirage detector needs text-conditioned semantic alignment rather than attention concentration alone.

Figure 20 further illustrates why output-overlap metrics alone are insufficient for mirage detection. Even when the visual evidence is absent or mismatched, the model can retain nontrivial lexical overlap with reference answers, suggesting that answer quality does not by itself certify groundedness.

Figure 21 shows that the impact of layer-wise intervention is highly condition-dependent. While some layers exhibit stronger degradation under perturbation, the overall patterns are not cleanly separable enough to define a reliable decision rule for mirage detection.

E.3 RAPT Probe Replication

We also tested a RAPT-style probe inspired by the “seeing but not believing” line of work Liu et al. (2026). The probe used Gemma-3-4B-IT and extracted relative attention per token for image and question spans over a 400-example, five-condition experiment. The conditions were: matched real image/question, unrelated image with real question, blank image with real question, real image with unrelated question, and real image with no question.

Image condition	Question condition	$n$	EM	F1	Image RAPT	Question RAPT
Blank	Real	80	6.2	11.2	0.105	1.211
Real	None	80	0.0	5.1	0.277	0.000
Real	Real	80	27.5	39.2	0.223	0.976
Real	Unrelated	80	0.0	2.0	0.153	1.381
Unrelated	Real	80	0.0	0.8	0.155	1.377

Table 8: RAPT-style diagnostic probe. RAPT reveals modality attention shifts but is not sufficient as a standalone mirage detector. Real/no-question inputs can show high image RAPT, while unrelated image/question pairs can still receive nontrivial image attention.

The key observation is that RAPT is diagnostic but not decisional. It shows whether the decoder allocates relative attention to image or question tokens, but it does not directly verify that the attended image content semantically answers the question. For example, unrelated image–question pairs still receive image attention, and real images without questions can produce high image RAPT even though no VQA answer is meaningful. This motivated TC-LIA’s focus on patch–text semantic alignment rather than aggregate modality attention.

Figure 22 complements the heatmap by summarizing how attention allocation moves under controlled perturbations. The deltas are meaningful as diagnostics, yet they still do not provide a robust criterion for separating related from unrelated-real cases.

E.4 High-resolution Pet Dataset Sanity Check

A larger natural-image sanity check used 100 Oxford-IIIT Pet image pairs with related and unrelated breed prompts. Final cosine and TC-LIA both performed extremely well because the task was comparatively easy. The internal-only probe also achieved high AUROC, confirming that intermediate features contain meaningful semantic information. However, this experiment did not represent the harder medical/document mirage setting, where same-domain and near-domain negatives are more challenging.

E.5 Mixed Medical–Nonmedical 200-pair Experiment

A mixed 200-pair experiment combined natural and medical images with related and unrelated prompts. TC-LIA features, including final cosine, late patch top- $k$ , gain, slope, and internal alignment score, separated related and unrelated pairs well. This experiment motivated the final DiverseVQA-style evaluation but was still too small to support the main paper claim.

E.6 Lessons from Failed Attempts

•

Final CLIP cosine is a strong baseline but can be overly optimistic on easy negatives.
•

Attention concentration alone does not imply semantic relevance.
•

RAPT captures modality allocation but not whether the image evidence answers the question.
•

GradCAM/SAM-style maps are valuable for visualization but require a scalar decision layer for deployment.
•

The final system needs both semantic alignment and supervised fusion to handle blank, unrelated-real, and related cases simultaneously.

Appendix F Detailed Differentiation from RAPT/VEA

What RAPT and VEA measure.

Relative Attention Per Token (RAPT) Liu et al. (2026) quantifies how much of a decoder’s attention budget is allocated to image tokens versus text tokens at each layer. Visual Evidence Augmentation (VEA) extends this by studying how attention shifts under controlled perturbations. Both methods ask: does the model attend to the image? Our method asks a different question: does the image contain question-relevant visual evidence? This distinction determines whether a detector can separate Unrelated-Real inputs from Related ones.

Figure 25 makes this limitation concrete. The Unrelated-Real curve (red) is nearly indistinguishable from the Matched curve (blue) in both image and question RAPT across all 34 layers of Gemma-3-4B-IT, even though the image bears no semantic relation to the question. The decoder allocates a similar attention budget to image tokens regardless of whether those tokens contain question-relevant content. RAPT therefore cannot separate the two conditions that matter most for mirage detection.

Single-layer knockout cannot suppress visual grounding.

We ran a controlled attention-knockout experiment on Gemma-3-4B-IT using 400 probes spanning 8 VQA domains (80 anchor samples $\times$ 5 conditions, $3\times 50\times 35=5{,}250$ inference passes). At each intervention, image tokens were masked from attending at one target layer $k$ , and RAPT was re-measured at all other layers.

Figure 23 shows the result for five representative knockout layers ( $k\in\{0,8,17,25,33\}$ ) on real inputs. Each knockout produces a sharp local dip only at the intervened layer; all other observed layers return immediately to near-zero deviation from baseline. The network compensates for the blocked layer by redistributing image attention downstream, a forward compensation effect.

Figure 24 generalises this across all 34 layers. The dark-blue diagonal marks local suppression at each knocked-out layer. The red upper-triangle confirms systematic forward compensation: downstream layers increase their image-attention allocation to recover the blocked signal, with early knockouts (rows 0–10) triggering the broadest redistribution because more downstream layers are available. No layer knockout systematically improved ROUGE-L or reduced mirage rate, confirming that visual grounding in Gemma-3 is distributed across the network and cannot be eliminated by single-layer intervention.

How TC-LIA differs.

Where RAPT tracks how much attention flows to image tokens, TC-LIA tracks whether question-conditioned patch evidence emerges in late vision-encoder layers. Specifically, TC-LIA computes top- $k$ patch–text cosine similarity at each of the 32 ViT-H/14 transformer blocks using the frozen text embedding of the question, not the decoder’s attention weights. A related image–question pair exhibits a rising late-layer alignment trajectory; an unrelated pair shows a flat or declining one, regardless of how the decoder allocates its attention budget.

This design sidesteps both failure modes of RAPT. First, it operates on the encoder before any decoder attention is computed, so there is no mechanism for downstream layers to compensate for a blocked signal. Second, it conditions alignment on the specific question text rather than measuring aggregate modality attention, making it sensitive to semantic mismatch rather than just image presence. TC-LIA is therefore decisional - its scalar features feed directly into an ensemble classifier, rather than being merely diagnostic.

Aspect	RAPT/VEA-style work	Our TC-LIA mirage detector
Primary question	Does the VLM attend to visual evidence and use it correctly?	Is there question-relevant visual evidence present before generation?
Input assumption	Usually assumes the image is relevant and evidence exists.	Explicitly includes related, unrelated-real, and blank/noise inputs.
Signal	Decoder-side attention mass over text/image/evidence tokens.	Vision-encoder patch embeddings aligned with text across layers.
Main operation	Highlight or mask evidence regions to improve answer generation.	Classify image–question pair and decide answer vs abstain.
Failure addressed	“Seeing but not believing”: evidence is present but underutilized.	Mirage risk: answer is generated despite absent or mismatched evidence.
Primary metric	EM/F1 improvement and evidence attribution quality.	Mirage rate, three-class accuracy, macro-F1, related recall.
Use in our paper	Related work and auxiliary diagnostic.	Main method.

Table 9: Conceptual distinction between RAPT/VEA-style evidence utilization and TC-LIA mirage detection.

Appendix G SAM3 Grounding as a Diagnostic Rather than a Detector

As a qualitative sanity check, we also visualize prompt-conditioned SAM3-style grounding maps. For related examples, the segmentation/grounding mask should cover plausible question-relevant image regions. For unrelated-real examples, the mask is expected to be absent, diffuse, low-confidence, or semantically inconsistent with the prompt. For blank/noise examples, no stable grounding should be produced.

Figure 26 shows representative SAM3-style grounding maps across all three input conditions.

Although SAM3-style grounding provides useful qualitative evidence, we found that it is not sufficiently reliable as a standalone mirage detector. For some Related examples, the generated masks missed clinically or semantically salient regions, especially when the relevant evidence was subtle, diffuse, or not easily described by a short phrase. Conversely, for Unrelated-Real or even Blank/Noise examples, the grounding model occasionally produced spurious masks on visually salient but question-irrelevant regions, indicating that mask presence alone does not guarantee question-relevant evidence. These failure modes suggest that grounding visualizations are valuable for interpretation and error analysis, but they should not be treated as a decisive pre-generation answerability test. We therefore use SAM3-style masks as auxiliary diagnostic evidence, while the main mirage detector relies on TC-LIA alignment features and ensemble fusion.

Appendix H Qualitative Result Cards

Figures 27 and 28 show result cards for Gemma-3-4B-IT and InternVL2-8B respectively, each covering the Related, Unrelated-Real, and Blank/Noise conditions on the same question. Additional result cards for Qwen2.5-VL-32B, Aya-Vision-32B, BLIP2-OPT-2.7B, InternVL3-38B, LLaVA-1.5-7B, and InternVL3_38B are shown in Figures 30–34.

Classifier	Accuracy $\uparrow$	Mirage Rate $\downarrow$	Macro-F1 $\uparrow$	CV Acc
RandomForest	94.7	3.0	0.947	0.902
XGBoost	94.5	3.1	0.945	0.902
GradientBoosting	94.5	3.1	0.945	0.904
AdaBoost	94.3	3.2	0.943	0.900
LightGBM	94.1	3.2	0.941	0.913

Table 10: Representative (Qwen2.5-VL-32B) ensemble comparison

Appendix I Classifier-Level Diagnostics

Figures 35–37 provide additional diagnostics for the ensemble classifiers. Figure 35 shows normalized confusion matrices for TC-LIA only and the best ensemble for each VLM backbone. Across models, blank/noise examples are almost perfectly separated, confirming that low-level invalid inputs are not the main error source. The remaining mistakes are concentrated between Related and Unrelated-Real, indicating that the hardest failure mode is semantic mismatch rather than blank-image detection.

Figure 36 compares the held-out accuracy and mirage rate of the five ensemble classifiers. The classifiers show broadly similar mirage suppression, but tree-based ensembles differ in the accuracy–mirage trade-off. This supports reporting the selected best ensemble per VLM rather than relying on a single classifier family in all settings. Figure 37 reports 5-fold cross-validation accuracy for each classifier and VLM backbone, showing that the selected classifiers are not chosen from a single unstable split.

Appendix J Full TC-LIA Score Comparison

Score	AUROC
Internal alignment score	0.963
Final cosine, ViT-H-14 only	0.963
Slope patch top- $k$	0.934
Gain patch top- $k$	0.930
Late patch top- $k$ mean	0.909

Table 11: TC-LIA component and baseline score comparison.

Figure 38 shows that IAS consistently matches or outperforms final_cos across all domains, with the largest gains in document and infographic settings where global cosine similarity is weakest.

Appendix K Per-Condition Answer Quality

Figure 15 reports the full per-condition answer-quality breakdown for Related, Unrelated-Real, and Blank/Noise inputs. For Related examples, the base prompt sometimes achieves higher lexical or semantic overlap with the ground-truth answers, especially for models that generate longer free-form responses. This is expected because the ensemble is optimized primarily for safe answer release rather than answer rewriting. Importantly, however, the ensemble preserves non-trivial answer quality on related inputs while dramatically improving behavior on non-answerable inputs.

For Unrelated-Real and Blank/Noise examples, the pattern is much clearer: the ensemble obtains consistently high BLEU, ROUGE-L, and BERTScore F1 against reference refusal phrases, whereas the base prompt remains substantially lower and more variable. This indicates that many base VLMs continue to produce content-bearing answers even when the image is unrelated or non-informative, while the proposed detector reliably converts such cases into refusal outputs. The blank/noise row shows the strongest improvement, with the ensemble approaching near-perfect refusal behavior for most VLM backbones. Overall, the quality analysis supports the intended operating point of the system: preserve useful answers for related image–question pairs while enforcing consistent refusal for mirage-prone inputs.

Appendix L Calibration and Operating Points

For deployment, the detector threshold can be selected according to an acceptable mirage-rate budget rather than only maximizing overall accuracy. Table 12 shows three operating points. A strict 1.0% mirage-rate target yields the safest behavior, but it reduces related recall to 79.3%, meaning that more answerable cases are conservatively refused. Relaxing the budget to 2.5% improves accuracy to 94.2% and related recall to 90.4%, giving the best macro-F1 among the three settings. At a 5.0% mirage-rate budget, related recall increases further to 96.6%, but accuracy and macro-F1 slightly decrease. These results illustrate the expected safety–coverage trade-off: lower mirage budgets provide stronger protection against unsupported answers, while higher budgets preserve more responses for genuinely related inputs.

Appendix M Per-Domain and Structured-Prompt Diagnostics

Figure 40 reports per-domain three-class accuracy for the best ensemble associated with each VLM backbone. The results show that performance is not uniform across datasets: some domains are consistently easier, while medical and pathology-style domains can be more variable due to domain-specific visual structure and question specificity. Nevertheless, the ensemble maintains strong accuracy across all five domains, supporting the claim that TC-LIA features transfer beyond a single dataset.

Figure 41 compares the accuracy of the VLM’s structured self-assessment alone against the full ensemble. Points above the diagonal indicate cases where TC-LIA, domain-routing, and pixel-statistic features improve over the VLM class prediction alone. The consistent gap demonstrates that the proposed detector is not merely relying on VLM self-refusal or self-classification; instead, the ensemble gains additional discriminative signal from layer-wise image–text alignment.

Appendix N Use of AI Assistants

AI assistance was used only for non-substantive writing support, including grammar correction, wording refinement, condensation of lengthy sections, and organization of appendix material. The research idea, experimental design, implementation, data analysis, results interpretation, figures, and final scientific claims were developed and verified by the authors. All AI-assisted edits were reviewed and revised by the authors, who take full responsibility for the content of the paper.

Target MR	Achieved MR	Accuracy	Related Recall	Macro-F1
1.0%	1.0%	92.0	79.3	0.919
2.5%	2.5%	94.2	90.4	0.942
5.0%	5.0%	93.8	96.6	0.937

Table 12: Operating points selected by mirage-rate budget.