arxiv: 2604.04532 · v1 · submitted 2026-04-06 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation

Alhasan Mahmood, Hasan Kurban, Samir Abdaljalil

Pith reviewed 2026-05-10 19:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multilingual evaluationagent-as-a-judgeprompt localizationlanguage sensitivityAI backbone comparisonrequirement-level judgmentcross-language consistency

0 comments

The pith

Changing the judge's language can reverse which AI backbone performs best on agent development tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that the language chosen for evaluating AI agents on coding tasks interacts strongly with the underlying model, often flipping the ranking of which backbone gives the highest satisfaction scores. By translating both task requirements and judge instructions into five languages and running thousands of requirement-level judgments, it shows that English defaults produce misleading comparisons. This interaction appears decisive: partial localization of instructions alone causes sharp drops in scores for some languages. A sympathetic reader would care because agent benchmarks that ignore language risk selecting frameworks that only look strong under English conditions.

Core claim

Localizing the Agent-as-a-Judge prompt stack across English, Arabic, Turkish, Chinese, and Hindi on 55 DevAI tasks shows that backbone and language interact: GPT-4o reaches the highest satisfaction in English at 44.72 percent while Gemini leads in Arabic at 51.72 percent and Hindi at 53.22 percent, with no backbone dominating all languages and inter-backbone agreement remaining modest at Fleiss kappa of at most 0.231. An ablation confirms that localizing judge-side instructions, not merely the benchmark content, drives the difference, as Hindi satisfaction falls from 42.8 percent to 23.2 percent under partial localization.

What carries the argument

The backbone-language interaction in requirement-level satisfaction judgments produced by fully localized judge prompts and translated task requirements.

Load-bearing premise

That translating requirements and judge instructions into other languages keeps their intended meaning and difficulty unchanged without introducing biases in how different models interpret the same requirements.

What would settle it

Re-running the full set of 4950 judgments after independent native-speaker re-translations of the prompts and checking whether the observed ranking inversions between backbones persist or disappear.

Figures

Figures reproduced from arXiv: 2604.04532 by Alhasan Mahmood, Hasan Kurban, Samir Abdaljalil.

**Figure 2.** Figure 2: Overview of the multilingual Agent-as-a-Judge pipeline. DevAI tasks are executed [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Percentile success rates (%) by language and backbone, averaged over three [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Requirement-type sensitivity (%) across languages and backbones. Each of the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Cross-backbone heatmaps summarizing central tendency and variability, averaged [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Cross-backbone performance and cost with standard-deviation error bars (825 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Requirement satisfaction (%) by language and framework for the GPT-4o baseline, [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Task Solve Rate (%) by language and framework for the GPT-4o baseline, shown [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Cross-language satisfaction profiles (left) and cross-language variability (right) [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Input tokens vs. satisfaction (left) and runtime vs. satisfaction (right) for the [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Task-level requirement satisfaction distributions by language and framework [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Runtime and token usage panels by language and framework for the GPT-4o [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

read the original abstract

Evaluation language is typically treated as a fixed English default in agentic code benchmarks, yet we show that changing the judge's language can invert backbone rankings. We localize the Agent-as-a-Judge prompt stack to five typologically diverse languages (English, Arabic, Turkish, Chinese, Hindi) and evaluate 55 DevAI development tasks across three developer-agent frameworks and six judge backbones, totaling 4950 judge runs. The central finding is that backbone and language interact: GPT-4o achieves the highest satisfaction in English (44.72\%), while Gemini leads in Arabic (51.72\%, $p<0.001$ vs.\ GPT-4o) and Hindi (53.22\%). No single backbone dominates across all languages, and inter-backbone agreement on individual requirement judgments is modest (Fleiss' $\kappa \leq 0.231$). A controlled ablation further shows that localizing judge-side instructions, not just benchmark content, can be decisive: Hindi satisfaction drops from 42.8\% to 23.2\% under partial localization. These results indicate that language should be treated as an explicit evaluation variable in agentic benchmarks. Full requirement-level judgments and runtime statistics are released for reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Judge language flips backbone rankings in agent evaluation, but without translation checks the differences could be artifacts.

read the letter

The main thing to know is that localizing the judge prompt to Arabic, Hindi, Turkish or Chinese can reverse which model ranks highest on requirement-level agent tasks. GPT-4o leads in English at 44.72% satisfaction while Gemini leads in Hindi at 53.22% and Arabic at 51.72%, with no backbone winning across all languages. They ran this across 55 DevAI tasks, three frameworks, six backbones and 4950 total judgments, plus a clean ablation showing partial localization drops Hindi scores from 42.8% to 23.2% and they release the raw judgments.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical study localizing the Agent-as-a-Judge prompt stack to five languages (English, Arabic, Turkish, Chinese, Hindi) and evaluating it on 55 DevAI tasks using three agent frameworks and six backbones for a total of 4950 runs. It claims that judge backbone and language interact strongly, with performance rankings inverting across languages (GPT-4o highest in English at 44.72%, Gemini highest in Arabic at 51.72% and Hindi at 53.22%, both p<0.001 vs GPT-4o), no backbone dominates universally, inter-backbone agreement is modest (Fleiss' κ ≤ 0.231), and that localizing judge instructions (not just content) is decisive as shown by a Hindi ablation (42.8% full vs 23.2% partial localization). The authors conclude that language should be treated as an explicit variable in agentic benchmarks and release the judgments for reproducibility.

Significance. If the results are robust to translation artifacts, this work is significant in demonstrating that standard English-centric evaluations may not generalize, providing evidence from a large-scale, controlled experiment with ablation and public data release. It contributes to the growing literature on multilingual LLM evaluation by focusing on agentic, requirement-level judging rather than simple QA.

major comments (2)

[Methods (localization procedure)] The assumption that localized requirements and prompts preserve equivalent meaning and difficulty across languages is central to interpreting the ranking inversions as backbone-language interactions. The manuscript reports no quantitative validation such as back-translation BLEU scores, bilingual human ratings of semantic equivalence, or difficulty calibration for the 55 tasks in Arabic, Turkish, Chinese, and Hindi. This leaves open the possibility that observed differences (e.g., Gemini 53.22% in Hindi) arise from translation quality variations or language-specific requirement biases.
[Results (ablation study)] While the ablation in Hindi shows a large drop from 42.8% to 23.2% satisfaction under partial localization, the paper does not specify what 'partial' entails in detail or provide controls to confirm that the partial version alters only the intended localization aspect without introducing other confounds.

minor comments (2)

[Abstract] The list of the six judge backbones and three developer-agent frameworks is not provided in the abstract; including them would improve readability.
[Abstract] The exact statistical test and correction method for the reported p<0.001 values should be stated explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review. We have carefully considered the major comments and provide point-by-point responses below, along with planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Methods (localization procedure)] The assumption that localized requirements and prompts preserve equivalent meaning and difficulty across languages is central to interpreting the ranking inversions as backbone-language interactions. The manuscript reports no quantitative validation such as back-translation BLEU scores, bilingual human ratings of semantic equivalence, or difficulty calibration for the 55 tasks in Arabic, Turkish, Chinese, and Hindi. This leaves open the possibility that observed differences (e.g., Gemini 53.22% in Hindi) arise from translation quality variations or language-specific requirement biases.

Authors: We agree that the absence of quantitative validation (such as back-translation metrics or human equivalence ratings) is a limitation in the current manuscript. The ablation study provides some supporting evidence by demonstrating a substantial performance difference when only judge instructions are localized versus full localization, using the same underlying requirements. We will revise the Methods section to expand on the localization procedure as implemented, add an explicit discussion of this limitation, and release all original and localized prompts to facilitate independent verification by the community. revision: partial
Referee: [Results (ablation study)] While the ablation in Hindi shows a large drop from 42.8% to 23.2% satisfaction under partial localization, the paper does not specify what 'partial' entails in detail or provide controls to confirm that the partial version alters only the intended localization aspect without introducing other confounds.

Authors: We agree that greater specificity is required. We will update the manuscript to explicitly define the partial localization condition (translating only the task requirements while retaining English judge instructions and criteria) and include the exact prompt templates for both full and partial conditions. We will also add a comparison to the fully English baseline to help isolate the intended effect and address potential confounds. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with direct statistical reporting

full rationale

The paper conducts controlled experiments localizing prompts and requirements across languages, then reports satisfaction percentages, p-values, and Fleiss' kappa computed from 4950 judge runs on 55 tasks. No equations, derivations, fitted parameters renamed as predictions, or self-citations that bear the load of the central claims exist. All results are direct counts and comparisons from the runs, making the derivation chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical benchmark study. No free parameters are fitted beyond standard statistical reporting. No new entities are postulated.

axioms (1)

standard math Statistical comparisons (p-values, Fleiss' kappa) are appropriately applied to the satisfaction scores and agreement metrics.
Invoked when reporting Gemini outperforming GPT-4o and low inter-backbone agreement.

pith-pipeline@v0.9.0 · 5523 in / 1357 out tokens · 70598 ms · 2026-05-10T19:58:18.058368+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
We localize the AAAJ prompt stack to five typologically diverse languages... backbone and language interact: GPT-4o achieves the highest satisfaction in English (44.72%), while Gemini leads in Arabic (51.72%) and Hindi (53.22%).

Reference graph

Works this paper leans on

32 extracted references · 17 canonical work pages · 4 internal anchors

[1]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Feldman, et al

Federico Cassano, Jacob Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q. Feldman, et al. MultiPL-E : A scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 49 0 (7): 0 3675--3691, 2023

2023
[3]

McEval : Massively multilingual code evaluation

Linzheng Chai, Shukai Liu, Jian Yang, Yuwei Yin, Ke Jin, Jiaheng Liu, Tao Sun, Ge Zhang, Changyu Ren, Hongcheng Guo, et al. McEval : Massively multilingual code evaluation. In International Conference on Learning Representations, 2025

2025
[4]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal A. Patwardhan, Lilian Weng, et al. MLE-Bench : Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024

work page Pith review arXiv 2024
[5]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Dilip Venkatesh, Raj Dabre, Anoop Kunchukuttan, and Mitesh M. Khapra. Cross-lingual auto evaluation for assessing multilingual LLM s. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 29297--29329, Vienna, Austria, 2025. Association fo...

work page doi:10.18653/v1/2025.acl-long.1419 2025
[7]

MASSIVE : A 1m-example multilingual natural language understanding dataset with 51 typologically-diverse languages

Jack FitzGerald, Christopher Hench, Charith Peris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa Singh, Swetha Ranganath, Laurie Crist, Misha Britan, Wouter Leeuwis, Gokhan Tur, and Prem Natarajan. MASSIVE : A 1m-example multilingual natural language understanding dataset with 51 typologically-diverse languages....

work page doi:10.18653/v1/2023.acl-long.235 2023
[8]

How reliable is multilingual LLM -as-a-judge? In Findings of the Association for Computational Linguistics: EMNLP 2025 , pp.\ 11040--11053, Suzhou, China, 2025

Xiyan Fu and Wei Liu. How reliable is multilingual LLM -as-a-judge? In Findings of the Association for Computational Linguistics: EMNLP 2025 , pp.\ 11040--11053, Suzhou, China, 2025. Association for Computational Linguistics. doi:10.18653/v1/2025.findings-emnlp.587. URL https://aclanthology.org/2025.findings-emnlp.587/

work page doi:10.18653/v1/2025.findings-emnlp.587 2025
[9]

Annotation artifacts in natural language inference data

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A Smith. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp.\ 107--112, 2018

2018
[10]

Nizar Y. Habash. Introduction to Arabic Natural Language Processing, volume 10 of Synthesis Lectures on Human Language Technologies. Morgan & Claypool, 2010

2010
[11]

MetaGPT : Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT : Meta programming for a multi-agent collaborative framework. In The Twelfth International Conference on Learning Representations, 2024

2024
[12]

Which is essential for chinese word segmentation: Character versus word

Chang-Ning Huang and Hai Zhao. Which is essential for chinese word segmentation: Character versus word. In Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation, pp.\ 1--12, Huazhong Normal University, Wuhan, China, 2006. Tsinghua University Press. URL https://aclanthology.org/Y06-1001/

2006
[13]

Time to impeach LLM -as-a-judge: Programs are the future of evaluation

Tzu-Heng Huang, Harit Vishwakarma, and Frederic Sala. Time to impeach LLM -as-a-judge: Programs are the future of evaluation. arXiv preprint arXiv:2506.10403, 2025

work page arXiv 2025
[14]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Kilian Lieret, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-Bench : Can language models resolve real-world GitHub issues? arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Data contamination: From memorization to exploitation

Inbal Magar and Roy Schwartz. Data contamination: From memorization to exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 157--165, 2022

2022
[16]

GitTaskBench : A benchmark for code agents solving real-world tasks through code repository leveraging

Ziyi Ni, Huacan Wang, Shuo Zhang, Shuo Lu, Ziyang He, Wang You, Zhenheng Tang, Yuntao Du, Bill Sun, Hongzhang Liu, et al. GitTaskBench : A benchmark for code agents solving real-world tasks through code repository leveraging. arXiv preprint arXiv:2508.18993, 2025

work page arXiv 2025
[17]

A syntactically expressive morphological analyzer for turkish

Adnan Ozturel, Tolga Kayadelen, and Isin Demirsahin. A syntactically expressive morphological analyzer for turkish. In Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing, pp.\ 65--75, Dresden, Germany, 2019. Association for Computational Linguistics. doi:10.18653/v1/W19-3110. URL https://aclanthology.o...

work page doi:10.18653/v1/w19-3110 2019
[18]

GPT-Pilot : Your AI copilot for software development

Pythagora.io . GPT-Pilot : Your AI copilot for software development. GitHub repository, 2023. URL https://github.com/Pythagora-io/gpt-pilot

2023
[19]

mHumanEval : A multilingual benchmark to evaluate large language models for code generation

Nishat Raihan, Antonios Anastasopoulos, and Marcos Zampieri. mHumanEval : A multilingual benchmark to evaluate large language models for code generation. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics, 2025

2025
[20]

Cross-lingual LLM -judge transfer via evaluation decomposition

Ivaxi Sheth, Zeno Jonke, Amin Mantrach, and Saab Mansour. Cross-lingual LLM -judge transfer via evaluation decomposition. arXiv preprint arXiv:2603.18557, 2026. doi:10.48550/arXiv.2603.18557. URL https://arxiv.org/abs/2603.18557

work page doi:10.48550/arxiv.2603.18557 2026
[21]

IndicGenBench : A multilingual benchmark to evaluate generation capabilities of LLM s on indic languages

Harman Singh, Nitish Gupta, Shikhar Bharadwaj, Dinesh Tewari, and Partha Talukdar. IndicGenBench : A multilingual benchmark to evaluate generation capabilities of LLM s on indic languages. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 11047--11073, Bangkok, Thailand, 2024. Associat...

work page doi:10.18653/v1/2024.acl-long.595 2024
[22]

MM-Eval : A multilingual meta-evaluation benchmark for LLM -as-a-judge and reward models

Guijin Son, Dongkeun Yoon, Juyoung Suk, Javier Aula-Blasco, Mano Aslan, Vu Trong Kim, Shayekh Bin Islam, Jaume Prats-Cristi \`a , Luc \'i a Tormo-Ba \ n uelos, and Seungone Kim. MM-Eval : A multilingual meta-evaluation benchmark for LLM -as-a-judge and reward models. arXiv preprint arXiv:2410.17578, 2024. doi:10.48550/arXiv.2410.17578. URL https://arxiv.o...

work page doi:10.48550/arxiv.2410.17578 2024
[23]

Ahilan Ayyachamy Nadar Ponnusamy

Atharv Sonwane, Eng-Shen Tu, Wei-Chung Lu, Claas Beger, Carter Larsen, Debjit Dhar, Simon Alford, Rachel Chen, Ronit Pattanayak, Tuan Anh Dang, et al. OmniCode : A benchmark for evaluating software engineering agents. arXiv preprint arXiv:2602.02262, 2026

work page arXiv 2026
[24]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. OpenDevin : An open platform for AI software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review arXiv 2024
[25]

Swe-compass: Towards unified evaluation of agentic coding abilities for large language models, 2025

Jingxuan Xu, Ken Deng, Weihao Li, Songwei Yu, Huaixi Tang, Haoyang Huang, Zhiyi Lai, Zizheng Zhan, Yanan Wu, Chenchen Zhang, et al. SWE-Compass : Towards unified evaluation of agentic coding abilities for large language models. arXiv preprint arXiv:2511.05459, 2025. doi:10.48550/arXiv.2511.05459. URL https://arxiv.org/abs/2511.05459

work page doi:10.48550/arxiv.2511.05459 2025
[26]

FLASK : Fine-grained language model evaluation based on alignment skill sets

Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, and Minjoon Seo. FLASK : Fine-grained language model evaluation based on alignment skill sets. In The Twelfth International Conference on Learning Representations, 2024

2024
[27]

Xing, et al

Lianmin Zheng, Wei-Lin Chiang, Siyuan Sheng, Sihan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, et al. Judging LLM -as-a-judge with MT-Bench and Chatbot Arena . In Advances in Neural Information Processing Systems, 2024

2024
[28]

Agent-as-a-judge: Evaluate agents with agents

Mingchen Zhuge, Changsheng Zhao, Dylan R. Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent-as-a-judge: Evaluate agents with agents. arXiv preprint arXiv:2410.10934, 2024

work page arXiv 2024
[29]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[30]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[31]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[32]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...