pith. machine review for the scientific record. sign in

arxiv: 2604.25235 · v2 · submitted 2026-04-28 · 💻 cs.LG · cs.CL· cs.CV· stat.ML

Recognition: unknown

VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

Amit Ranjan Trivedi, Devashri Naik, Divake Kumar, Ranganath Krishnan, Sina Tayebati

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:41 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CVstat.ML
keywords VLM judgesconformal predictionmultimodal evaluationprediction intervalsranking-scoring decouplingtask-dependent uncertaintyvision-language modelsscore reliability
0
0 comments X

The pith

VLM judges can rank responses correctly but fail to assign reliable absolute scores due to strongly task-dependent uncertainty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines vision-language models acting as judges for multimodal tasks and finds that their point scores lack any built-in reliability signal. Using conformal prediction on score-token log-probabilities, it constructs prediction intervals that reveal wide variation in uncertainty across 14 visual task categories. Judges maintain high ranking correlation even when intervals span most of the score range, exposing a ranking-scoring decoupling that standard metrics miss. The work produces a quantitative reliability map showing much tighter intervals on clean captioning benchmarks than on chart or math reasoning tasks.

Core claim

VLM judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Evaluation uncertainty is strongly task-dependent: intervals cover roughly 40 percent of the score range for aesthetics and natural images but expand to about 70 percent for chart and mathematical reasoning tasks. Interval width is driven primarily by task difficulty and annotation quality, with the same judge and method yielding 4.5 times narrower intervals on a clean, multi-annotator captioning benchmark.

What carries the argument

Conformal prediction intervals built directly from the judge's score-token log-probabilities, converting a point score into a calibrated interval without retraining or additional data assumptions.

Load-bearing premise

The conformal prediction intervals constructed from score-token log-probabilities are valid and meaningful for VLM outputs without extra calibration or assumptions on the underlying data distribution.

What would settle it

A finding that the constructed intervals stay narrow, cover the true scores at the claimed rate, and show little variation across all 14 task categories would falsify the task-dependent uncertainty and ranking-scoring decoupling claims.

Figures

Figures reproduced from arXiv: 2604.25235 by Amit Ranjan Trivedi, Devashri Naik, Divake Kumar, Ranganath Krishnan, Sina Tayebati.

Figure 1
Figure 1. Figure 1: R2CCP interval width across 14 task categories for all three judges. Task-dependent view at source ↗
Figure 2
Figure 2. Figure 2: MLLM-Judge vs. Polaris: same judge (LLaVA-Critic-7B), same CP method (R2CCP, view at source ↗
Figure 3
Figure 3. Figure 3: Ranking-scoring decoupling across datasets. The horizontal axis is Pearson corre view at source ↗
read the original abstract

Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge's point score into a calibrated prediction interval using only score-token log-probabilities, with no retraining. We present the first systematic analysis of conformal prediction for VLM-as-a-Judge across 3 judges and 14 visual task categories. Our results show that evaluation uncertainty is strongly task-dependent: intervals cover ~40% of the score range for aesthetics and natural images but expand to ~70% for chart and mathematical reasoning, yielding a quantitative reliability map for multimodal evaluation. We further identify a failure mode not captured by standard evaluation metrics, ranking-scoring decoupling, where judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Finally, we show that interval width is driven primarily by task difficulty and annotation quality, i.e., the same judge and method yield 4.5x narrower intervals on a clean, multi-annotator captioning benchmark. Code: https://github.com/divake/VLM-Judge-Uncertainty

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper applies conformal prediction to convert point scores from VLM judges into calibrated prediction intervals using only score-token log-probabilities, without retraining. Across 3 VLMs and 14 visual task categories, it reports that interval widths are strongly task-dependent (narrower for aesthetics/natural images, wider for charts/math reasoning), identifies a ranking-scoring decoupling where high rank correlation coexists with uninformative absolute-score intervals, and shows that interval width is driven by task difficulty and annotation quality (e.g., 4.5x narrower on clean multi-annotator data). Code is released.

Significance. If the intervals are valid, the work supplies a practical, distribution-free tool for quantifying VLM-judge reliability and produces a task-specific reliability map that could inform evaluation pipelines in multimodal systems. The ranking-scoring decoupling observation and the link to annotation quality are useful distinctions not captured by standard metrics. Releasing reproducible code is a clear strength.

major comments (2)
  1. [§4 and §3.2] §4 (Experiments) and §3.2 (Conformal Prediction Setup): No table, figure, or text reports empirical coverage rates on held-out data to verify that the constructed intervals attain the nominal level (e.g., 90 %). The central claims—task-dependent interval widths and ranking-scoring decoupling—rest on these intervals being calibrated; without coverage diagnostics the width comparisons and reliability map are not yet interpretable.
  2. [§5.1] §5.1 (Ranking-scoring decoupling): The decoupling is demonstrated by juxtaposing Spearman rank correlation against interval width, but the manuscript provides no statistical test for the significance of the observed dissociation nor controls for task-specific variance in score distributions. This weakens the claim that the phenomenon is a distinct failure mode rather than a direct consequence of wide intervals.
minor comments (3)
  1. [Abstract and §2] Abstract and §2: The phrase “first systematic analysis” should be qualified by explicit comparison to prior conformal-prediction applications in NLP or vision-language evaluation.
  2. [Figure 2 and Table 1] Figure 2 and Table 1: Axis labels and legends should explicitly state the conformal significance level α used for all reported intervals.
  3. [§3.2] Notation: The mapping from raw score-token log-probabilities to nonconformity scores is described only in prose; a short equation or pseudocode block would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and outline revisions that will strengthen the empirical support for our claims while preserving the core contributions of the work.

read point-by-point responses
  1. Referee: [§4 and §3.2] §4 (Experiments) and §3.2 (Conformal Prediction Setup): No table, figure, or text reports empirical coverage rates on held-out data to verify that the constructed intervals attain the nominal level (e.g., 90 %). The central claims—task-dependent interval widths and ranking-scoring decoupling—rest on these intervals being calibrated; without coverage diagnostics the width comparisons and reliability map are not yet interpretable.

    Authors: We agree that explicit empirical coverage diagnostics would improve interpretability. Although the conformal prediction procedure we employ carries a theoretical marginal coverage guarantee under exchangeability, reporting achieved coverage on held-out data is a standard and useful practice. In the revised manuscript we will add a table in §4 (or a dedicated subsection) that reports empirical coverage rates for the nominal 90 % level across all three VLMs and the 14 task categories. This addition will directly confirm calibration and support the subsequent width comparisons and reliability map. revision: yes

  2. Referee: [§5.1] §5.1 (Ranking-scoring decoupling): The decoupling is demonstrated by juxtaposing Spearman rank correlation against interval width, but the manuscript provides no statistical test for the significance of the observed dissociation nor controls for task-specific variance in score distributions. This weakens the claim that the phenomenon is a distinct failure mode rather than a direct consequence of wide intervals.

    Authors: The referee correctly notes the absence of formal statistical controls. To address this, we will augment §5.1 with (i) a test of the null hypothesis that the observed dissociation between rank correlation and interval width is explained solely by interval width (e.g., via partial correlation or a regression that includes task-level score variance as a covariate) and (ii) stratification or regression-based controls for task-specific variance. These additions will clarify that the decoupling is not an artifact of wide intervals alone and will strengthen the presentation of it as a distinct failure mode. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard conformal framework applied empirically

full rationale

The paper applies the known distribution-free conformal prediction procedure to convert VLM point scores into intervals using score-token log-probabilities as the nonconformity measure. All reported findings (task-dependent interval widths, ranking-scoring decoupling, and the 4.5x width difference on clean benchmarks) are empirical observations obtained by running the fixed method across 14 task categories and 3 judges. No equation reduces a claimed prediction to a fitted parameter by construction, no uniqueness theorem is imported from prior self-work, and no ansatz is smuggled via citation. The derivation chain remains self-contained against the external conformal prediction literature and the experimental data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work relies on the standard conformal prediction axioms and one typical hyperparameter; no new entities postulated.

free parameters (1)
  • conformal significance level (alpha)
    Standard parameter in conformal prediction to control coverage probability; implied but not specified in abstract.
axioms (1)
  • standard math Conformal prediction provides distribution-free valid prediction intervals under exchangeability assumption.
    Core assumption of the conformal prediction framework used.

pith-pipeline@v0.9.0 · 5542 in / 1253 out tokens · 48358 ms · 2026-05-07T16:41:55.981858+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    Phi-4-reasoning technical report, 2025

    Marah Abdin, Sahil Agarwal, Aman Agrawal, et al. Phi-4-reasoning technical report. arXiv preprint arXiv:2504.21318, 2025

  2. [2]

    and Bates, Stephen and Cand

    Anastasios N Angelopoulos, Stephen Bates, Emmanuel J Cand \`e s, Michael I Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control. arXiv preprint arXiv:2110.01052, 2021

  3. [3]

    Navigating the unknown: Uncertainty-aware compute-in-memory autonomy of edge robotics

    Nastaran Darabi, Priyesh Shukla, Dinithi Jayasuriya, Divake Kumar, Alex Christopher Stutts, and Amit Ranjan Trivedi. Navigating the unknown: Uncertainty-aware compute-in-memory autonomy of edge robotics. In 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp.\ 1--6, 2024

  4. [4]

    MLLM -as-a-judge: Assessing multimodal LLM -as-a-judge with vision-language benchmark

    Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. MLLM -as-a-judge: Assessing multimodal LLM -as-a-judge with vision-language benchmark. In International Conference on Machine Learning (ICML), 2024

  5. [5]

    Calibrated decomposition of aleatoric and epistemic uncertainty in deep features for inference-time adaptation

    Divake Kumar, Patrick Poggi, Sina Tayebati, Devashri Naik, Nilesh Ahuja, and Amit Ranjan Trivedi. Calibrated decomposition of aleatoric and epistemic uncertainty in deep features for inference-time adaptation. arXiv preprint arXiv:2511.12389, 2025 a

  6. [6]

    Gemini 2.5: Our newest Gemini model with thinking

    Google DeepMind . Gemini 2.5: Our newest Gemini model with thinking. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/, 2025

  7. [7]

    arXiv preprint arXiv:2502.06884 (2025) arXiv:2502.06884 25

    Sina Tayebati, Divake Kumar, Nastaran Darabi, Dinithi Jayasuriya, Ranganath Krishnan, and Amit Ranjan Trivedi. Learning conformal abstention policies for adaptive risk management in large language and vision-language models. arXiv preprint arXiv:2502.06884, 2025 a

  8. [8]

    Conformal prediction via regression-as-classification

    Etash Guha, Shlok Natarajan, Thomas M \"o llenhoff, Mohammad Emtiyaz Khan, and Eugene Ndiaye. Conformal prediction via regression-as-classification. In International Conference on Learning Representations (ICLR), 2024

  9. [9]

    SocREval : Large language models with the socratic method for reference-free reasoning evaluation

    Hangfeng He, Hongming Zhang, and Dan Roth. SocREval : Large language models with the socratic method for reference-free reasoning evaluation. In Findings of the Association for Computational Linguistics: NAACL 2024, 2024

  10. [10]

    Conformal inference meets evidential learning: Distribution-free uncertainty quantification with epistemic and aleatoric separability

    Alex Christopher Stutts, Divake Kumar, Theja Tulabandhula, and Amit Ranjan Trivedi. Conformal inference meets evidential learning: Distribution-free uncertainty quantification with epistemic and aleatoric separability. In Proceedings of the 61st ACM/IEEE Design Automation Conference (DAC), pp.\ 1--4, 2024

  11. [11]

    Conformal prediction with large language models for multi-choice question answering

    Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404, 2023

  12. [12]

    VL-RewardBench : A challenging benchmark for vision-language generative reward models

    Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, and Qi Liu. VL-RewardBench : A challenging benchmark for vision-language generative reward models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  13. [13]

    INTACT : Inducing noise tolerance through adversarial curriculum training for LiDAR -based safety-critical perception and autonomy

    Nastaran Darabi, Divake Kumar, Sina Tayebati, and Amit Ranjan Trivedi. INTACT : Inducing noise tolerance through adversarial curriculum training for LiDAR -based safety-critical perception and autonomy. arXiv preprint arXiv:2502.01896, 2025

  14. [14]

    Locally valid and discriminative prediction intervals for deep learning models

    Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Locally valid and discriminative prediction intervals for deep learning models. In Advances in Neural Information Processing Systems, 2021

  15. [15]

    Uncertainty-aware LiDAR -camera autonomy via conformal prediction and principled abstention

    Divake Kumar, Sina Tayebati, Nastaran Darabi, Vita Pi-Ho Hu, and Amit Ranjan Trivedi. Uncertainty-aware LiDAR -camera autonomy via conformal prediction and principled abstention. In 2025 IEEE International Conference on Omni-layer Intelligent Systems (COINS), pp.\ 1--6. IEEE, 2025 b . doi:10.1109/COINS65080.2025.11125785

  16. [16]

    G-Eval : NLG evaluation using GPT-4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval : NLG evaluation using GPT-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 2511--2522, 2023

  17. [17]

    Fair conformal predictors for applications in medical imaging

    Charles Lu, Andr \'e anne Lemay, Ken Chang, Katharina H \"o bel, and Jayashree Kalpathy-Cramer. Fair conformal predictors for applications in medical imaging. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.\ 12008--12016, 2022

  18. [18]

    Belief Dynamics for Detecting Behavioral Shifts in Safe Collaborative Manipulation

    Devashri Naik, Divake Kumar, Nastaran Darabi, and Amit Ranjan Trivedi. Belief dynamics for detecting behavioral shifts in safe collaborative manipulation. arXiv preprint arXiv:2604.04967, 2026

  19. [19]

    Language models with conformal factuality guarantees

    Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees. In International Conference on Machine Learning (ICML), 2024

  20. [20]

    Uncertainty-guided inference-time depth adaptation for transformer-based visual tracking

    Patrick Poggi, Divake Kumar, Theja Tulabandhula, and Amit Ranjan Trivedi. Uncertainty-guided inference-time depth adaptation for transformer-based visual tracking. arXiv preprint arXiv:2602.16160, 2026

  21. [21]

    Conformal language modeling

    Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S Jaakkola, and Regina Barzilay. Conformal language modeling. In International Conference on Learning Representations (ICLR), 2024

  22. [22]

    CAP : Conformalized abstention policies for context-adaptive risk management for LLMs and VLMs

    Sina Tayebati, Divake Kumar, Nastaran Darabi, Dinithi Jayasuriya, Theja Tulabandhula, Ranganath Krishnan, and Amit Ranjan Trivedi. CAP : Conformalized abstention policies for context-adaptive risk management for LLMs and VLMs . In Proceedings of the 17th Asian Conference on Machine Learning (ACML), Conference Track, 2025 b

  23. [23]

    Conformalized quantile regression

    Yaniv Romano, Evan Patterson, and Emmanuel Cand \`e s. Conformalized quantile regression. In Advances in Neural Information Processing Systems, 2019

  24. [24]

    Learnable conformal prediction with context-aware nonconformity functions for robotic planning and perception

    Divake Kumar, Sina Tayebati, Francesco Migliarba, Ranganath Krishnan, and Amit Ranjan Trivedi. Learnable conformal prediction with context-aware nonconformity functions for robotic planning and perception. arXiv preprint arXiv:2509.21955, 2025 c

  25. [25]

    A comparison of some conformal quantile regression methods

    Matteo Sesia and Emmanuel J Cand \`e s. A comparison of some conformal quantile regression methods. Stat, 9 0 (1): 0 e261, 2020

  26. [26]

    Conformal prediction using conditional histograms

    Matteo Sesia and Yaniv Romano. Conformal prediction using conditional histograms. In Advances in Neural Information Processing Systems, 2021

  27. [27]

    Intelligent sensing-to-action for robust autonomy at the edge: Opportunities and challenges

    Amit Ranjan Trivedi, Sina Tayebati, Hemant Kumawat, Nastaran Darabi, Divake Kumar, Adarsh Kumar Kosta, Yeshwanth Venkatesha, Dinithi Jayasuriya, Nethmi Jayasinghe, Priyadarshini Panda, Saibal Mukhopadhyay, and Kaushik Roy. Intelligent sensing-to-action for robust autonomy at the edge: Opportunities and challenges. In 2025 Design, Automation & Test in Euro...

  28. [28]

    Analyzing uncertainty of LLM -as-a-judge: Interval evaluations with conformal prediction

    Huanxin Sheng, Xinyi Liu, Hangfeng He, Jieyu Zhao, and Jian Kang. Analyzing uncertainty of LLM -as-a-judge: Interval evaluations with conformal prediction. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 11286--11328, 2025

  29. [29]

    API is enough: Conformal prediction for large language models without logit-access

    Jiayuan Su, Jing Luo, Hongwei Wang, and Lu Cheng. API is enough: Conformal prediction for large language models without logit-access. In Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

  30. [30]

    TRACER : Trajectory risk aggregation for critical episodes in agentic reasoning

    Sina Tayebati, Divake Kumar, Nastaran Darabi, Davide Ettori, Ranganath Krishnan, and Amit Ranjan Trivedi. TRACER : Trajectory risk aggregation for critical episodes in agentic reasoning. arXiv preprint arXiv:2602.11409, 2026

  31. [31]

    Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

    Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

  32. [32]

    Algorithmic Learning in a Random World

    Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005

  33. [33]

    TRIAGE : Type-routed interventions via aleatoric-epistemic gated estimation in robotic manipulation and adaptive perception---don't treat all uncertainty the same

    Divake Kumar, Sina Tayebati, Devashri Naik, Patrick Poggi, Amanda Sofie Rios, Nilesh Ahuja, and Amit Ranjan Trivedi. TRIAGE : Type-routed interventions via aleatoric-epistemic gated estimation in robotic manipulation and adaptive perception---don't treat all uncertainty the same. arXiv preprint arXiv:2603.08128, 2026

  34. [34]

    Polos: Multimodal metric learning from human feedback for image captioning

    Yuiga Wada, Kanta Kaneda, Daichi Saito, and Komei Sugiura. Polos: Multimodal metric learning from human feedback for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  35. [35]

    Black-box uncertainty quantification method for LLM -as-a-judge

    Nico Wagner, Michael Desmond, Rahul Nair, Zahra Ashktorab, Elizabeth M Daly, Qian Pan, Martin Santillan Cooper, James M Johnson, and Werner Geyer. Black-box uncertainty quantification method for LLM -as-a-judge. arXiv preprint arXiv:2410.11594, 2024

  36. [36]

    Boosted conformal prediction intervals

    Ran Xie, Rina Foygel Barber, and Emmanuel J Cand \`e s. Boosted conformal prediction intervals. In Advances in Neural Information Processing Systems, 2024

  37. [37]

    Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs . In International Conference on Learning Representations (ICLR), 2024

  38. [38]

    LLaVA-Critic : Learning to evaluate multimodal models

    Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li. LLaVA-Critic : Learning to evaluate multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  39. [39]

    Judging LLM -as-a-judge with MT-Bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P Xing, et al. Judging LLM -as-a-judge with MT-Bench and chatbot arena. In Advances in Neural Information Processing Systems, 2023

  40. [40]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  41. [41]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  42. [42]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...