arxiv: 2604.27421 · v1 · submitted 2026-04-30 · 💻 cs.IR · cs.CL

Recognition: unknown

A Reproducibility Study of LLM-Based Query Reformulation

Amin Bigdeli , Radin Hamidi Rad , Hai Son Le , Mert Incesu , Negar Arabzadeh , Charles L. A. Clarke , Ebrahim Bagheri

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:00 UTC · model grok-4.3

classification 💻 cs.IR cs.CL

keywords query reformulationLLMreproducibility studyinformation retrievalneural retrievallexical retrievalbenchmark evaluation

0 comments

The pith

Reformulation gains from LLMs in search are strongly conditioned on the retrieval paradigm used.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a controlled comparison of ten LLM-based query reformulation techniques to check which reported gains hold up when everything else is fixed. It finds that positive results seen with lexical retrieval frequently disappear when the same methods are applied to learned sparse or dense neural retrievers. Larger LLMs also do not reliably produce better end-to-end performance than smaller ones. A reader cares because many deployed search systems now use neural retrievers, so gains measured only on older lexical setups may not deliver value in practice. The work supplies a shared toolkit and leaderboard so future comparisons can use the same controls.

Core claim

Under a single experimental framework covering two LLM families at two scales, three retrieval paradigms, and nine datasets, LLM query reformulation produces effectiveness gains that depend on the retrieval paradigm, with lexical improvements not transferring consistently to neural retrievers and larger models not uniformly improving results.

What carries the argument

The unified experimental framework that re-implements ten representative LLM reformulation methods with identical prompts, controls, and evaluation across lexical, learned sparse, and dense retrieval on TREC Deep Learning and BEIR collections.

Load-bearing premise

The ten chosen methods adequately represent the space of LLM query reformulation and the single controlled setup removes all original implementation differences.

What would settle it

A follow-up experiment that applies the same ten methods under the same controls but measures consistent positive transfer from lexical to dense retrieval on multiple datasets would falsify the conditioning claim.

Figures

Figures reproduced from arXiv: 2604.27421 by Amin Bigdeli, Charles L. A. Clarke, Ebrahim Bagheri, Hai Son Le, Mert Incesu, Negar Arabzadeh, Radin Hamidi Rad.

**Figure 1.** Figure 1: Domain–level effectiveness variation of LLM–based query reformulation. (a) Distribution of per–query view at source ↗

**Figure 3.** Figure 3: Radar charts comparing method performance view at source ↗

**Figure 2.** Figure 2: Rank coefficient of variation (RankCV) for all view at source ↗

read the original abstract

Large Language Models (LLMs) are now widely used for query reformulation and expansion in Information Retrieval, with many studies reporting substantial effectiveness gains. However, these results are typically obtained under heterogeneous experimental conditions, making it difficult to assess which findings are reproducible and which depend on specific implementation choices. In this work, we present a systematic reproducibility and comparative study of ten representative LLM-based query reformulation methods under a unified and strictly controlled experimental framework. We evaluate methods across two architectural LLM families at two parameter scales, three retrieval paradigms (lexical, learned sparse, and dense), and nine benchmark datasets spanning TREC Deep Learning and BEIR. Our results show that reformulation gains are strongly conditioned on the retrieval paradigm, that improvements observed under lexical retrieval do not consistently transfer to neural retrievers, and that larger LLMs do not uniformly yield better downstream performance. These findings clarify the stability and limits of reported gains in prior work. To enable transparent replication and ongoing comparison, we release all prompts, configurations, evaluation scripts, and run files through QueryGym, an open-source reformulation toolkit with a public leaderboard.\footnote{https://leaderboard.querygym.com}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This controlled reproducibility study shows LLM query reformulation gains are tied to lexical retrieval and do not transfer to neural methods, with larger models not always helping more.

read the letter

Hi, the one or two things to know are that this paper's experiments show reformulation gains depend heavily on the retrieval paradigm and fail to transfer from lexical to neural retrievers, plus that bigger LLMs do not reliably improve results downstream. They reimplemented ten methods from the literature in one codebase with fixed prompts, then tested them across lexical, learned-sparse, and dense retrieval on nine standard datasets from TREC Deep Learning and BEIR. That single-framework design produces direct comparisons that support the conditioning claim and explains why earlier papers reported mixed outcomes. Releasing all prompts, configs, scripts, and run files via QueryGym is a practical step that lets others check or extend the work without starting from scratch. The soft spots are minor. Picking ten representative methods leaves room for debate on coverage, but the paper treats this as a scope limit rather than claiming universality, and the within-study differences still hold. Fixed prompts aid comparability yet might not optimize every method individually. The experimental controls look tight enough that the main patterns follow from the data rather than hidden confounds. This is aimed at IR researchers and practitioners who are adding LLMs to retrieval pipelines and want to know where the reported gains actually apply. Readers who care about deployment limits or reproducibility will get direct value. It deserves a serious referee because the unified setup supplies clear empirical evidence on stability that prior heterogeneous studies lacked. I would send it for peer review.

Referee Report

0 major / 3 minor

Summary. The manuscript presents a reproducibility study of ten LLM-based query reformulation methods in information retrieval. The authors re-implement these methods under strictly controlled conditions using the same prompts, post-processing steps, two LLM families at two scales, three retrieval paradigms (lexical, learned-sparse, and dense), and nine standard benchmark datasets from TREC Deep Learning and BEIR. Key findings include that reformulation gains are strongly dependent on the retrieval paradigm, lexical gains do not consistently transfer to neural retrievers, and larger LLMs do not always lead to better performance. All artifacts are released via the QueryGym toolkit and a public leaderboard.

Significance. This study is significant because it provides a controlled comparison that explains why prior results on LLM query reformulation have been inconsistent across papers. By isolating the effect of retrieval paradigm and LLM scale, it demonstrates the limits of generalizing gains from lexical to neural settings. The open release of prompts, code, run files, and the leaderboard is a notable strength that supports reproducibility and future work in the field. If the results hold under the released artifacts, this will serve as a valuable reference for the community.

minor comments (3)

The introduction could more explicitly list the research questions that the unified framework is designed to answer, to better frame the subsequent results sections.
Tables reporting nDCG@10 and Recall@100 would benefit from including standard deviations or confidence intervals across runs to allow readers to assess the stability of the reported differences between paradigms.
The QueryGym toolkit is referenced with a URL; a short paragraph in §3 describing its core components (prompt templates, post-processing, evaluation harness) would improve self-containment without relying on the external site.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our reproducibility study, including its significance in explaining inconsistent prior results on LLM query reformulation. The recommendation for minor revision is noted. As the report lists no specific major comments, we have no point-by-point rebuttals to provide and will incorporate any minor editorial suggestions in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a purely empirical reproducibility study that re-implements ten existing LLM-based query reformulation methods from prior literature and evaluates them under unified experimental controls across retrieval paradigms, LLM scales, and datasets. The central claims (conditioning of gains on retrieval paradigm, non-transfer from lexical to neural, non-uniform scaling benefits) are direct observations from the new controlled runs rather than quantities defined in terms of fitted parameters or reduced to the paper's own prior results by equation or self-citation chain. No derivation steps exist; the work is self-contained against external benchmarks via released artifacts and does not invoke uniqueness theorems or ansatzes from the authors' own prior papers as load-bearing premises.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard information-retrieval evaluation assumptions and the representativeness of the chosen methods and datasets; no new free parameters or invented entities are introduced.

axioms (2)

domain assumption Standard TREC and BEIR metrics (nDCG, Recall) are appropriate and comparable across lexical, learned-sparse, and dense retrieval paradigms.
Invoked when reporting effectiveness gains across the three retrieval types.
domain assumption The ten chosen reformulation methods are representative of the LLM-based query reformulation literature.
Used to justify the scope of the reproducibility study.

pith-pipeline@v0.9.0 · 5523 in / 1246 out tokens · 37149 ms · 2026-05-07T09:00:06.686280+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 14 canonical work pages

[1]

Nasreen Abdul-Jaleel, James Allan, W Bruce Croft, Fernando Diaz, Leah Larkey, Xiaoyan Li, Mark D Smucker, and Courtney Wade. 2004. UMass at TREC 2004: Novelty and HARD. (2004)

2004
[2]

Negar Arabzadeh, Amin Bigdeli, Shirin Seyedsalehi, Morteza Zihayat, and Ebrahim Bagheri. 2021. Matches made in heaven: Toolkit and large-scale datasets for supervised query reformulation. InProceedings of the 30th ACM International Conference on Information & Knowledge Management. 4417–4425. A Reproducibility Study of LLM-Based Query Reformulation SIGIR ’...

2021
[3]

Jagdev Bhogal, Andrew MacFarlane, and Peter Smith. 2007. A review of ontology based query expansion.Information processing & management43, 4 (2007), 866– 886

2007
[4]

Amin Bigdeli, Negar Arabzadeh, and Ebrahim Bagheri. 2024. Learning to jointly transform and rank difficult queries. InEuropean Conference on Information Retrieval. Springer, 40–48

2024
[5]

Amin Bigdeli, Mert Incesu, Negar Arabzadeh, Charles LA Clarke, and Ebrahim Bagheri. 2026. ReFormeR: Learning and Applying Explicit Query Reformulation Patterns. InEuropean Conference on Information Retrieval. Springer, 400–408

2026
[6]

Amin Bigdeli, Radin Hamidi Rad, Mert Incesu, Negar Arabzadeh, Charles L. A. Clarke, and Ebrahim Bagheri. 2026. QueryGym: A Toolkit for Reproducible LLM-Based Query Reformulation. InACM Web Conference 2026(2026-01-12). https://doi.org/10.48550/ARXIV.2511.15996

work page doi:10.48550/arxiv.2511.15996 2026
[7]

Paolo Boldi, Francesco Bonchi, Carlos Castillo, and Sebastiano Vigna. 2011. Query reformulation mining: models, patterns, and applications.Information retrieval 14, 3 (2011), 257–289

2011
[8]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2021. Overview of the TREC 2020 deep learning track.CoRRabs/2102.07662 (2021). arXiv:2102.07662 https://arxiv.org/abs/2102.07662

work page arXiv 2021
[9]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2020. Overview of the TREC 2019 deep learning track.arXiv preprint arXiv:2003.07820(2020)

work page arXiv 2020
[10]

Van Dang and Bruce W Croft. 2010. Query reformulation using anchor text. In Proceedings of the third ACM international conference on Web search and data mining. 41–50

2010
[11]

Alin Deutsch, Lucian Popa, and Val Tannen. 2006. Query reformulation with constraints.ACM SIGMOD Record35, 1 (2006), 65–73

2006
[12]

Kaustubh D Dhole and Eugene Agichtein. 2024. Genqrensemble: Zero-shot llm ensemble prompting for generative query reformulation. InEuropean Conference on Information Retrieval. Springer, 326–335

2024
[13]

Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant
[14]

InProceedings of the 45th international ACM SIGIR conference on research and development in information retrieval

From distillation to hard negative sampling: Making sparse neural ir models more effective. InProceedings of the 45th international ACM SIGIR conference on research and development in information retrieval. 2353–2359
[15]

Luyu Gao, Zhuyun Dai, Tongfei Chen, Zhen Fan, Benjamin Van Durme, and Jamie Callan. 2021. Complement lexical retrieval model with semantic residual embeddings. InEuropean Conference on Information Retrieval. Springer, 146–160

2021
[16]

Seyed Mohammad Hosseini, Negar Arabzadeh, Morteza Zihayat, and Ebrahim Bagheri. 2024. Enhanced retrieval effectiveness through selective query genera- tion. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 3792–3796

2024
[17]

Rolf Jagerman, Honglei Zhuang, Zhen Qin, Xuanhui Wang, and Michael Bender- sky. 2023. Query expansion by prompting large language models.arXiv preprint arXiv:2305.03653(2023)

work page arXiv 2023
[18]

Ehsan Kamalloo, Nandan Thakur, Carlos Lassance, Xueguang Ma, Jheng-Hong Yang, and Jimmy Lin. 2024. Resources for Brewing BEIR: Reproducible Reference Models and Statistical Analyses. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24). Association for Computing Machinery, New York, ...

work page doi:10.1145/3626772.3657862 2024
[19]

Yibin Lei, Yu Cao, Tianyi Zhou, Tao Shen, and Andrew Yates. 2024. Corpus- Steered Query Expansion with Large Language Models. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), Yvette Graham and Matthew Purver (Eds.). Association for Computational Linguistics, St. Julian’...

work page doi:10.18653/v1/2024.eacl-short.34 2024
[20]

Jimmy Lin. 2019. The neural hype and comparisons against weak baselines. In Acm sigir forum, Vol. 52. ACM New York, NY, USA, 40–51

2019
[21]

Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Infor- mation Retrieval Research with Sparse and Dense Representations. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’21). Association f...

work page doi:10.1145/3404835.3463238 2021
[22]

Meili Lu, Xiaobing Sun, Shaowei Wang, David Lo, and Yucong Duan. 2015. Query expansion via wordnet for effective code search. In2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 545–549

2015
[23]

Xueguang Ma, Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. 2022. Document expansion baselines and learned sparse lexical representations for ms marco v1 and v2. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3187–3197

2022
[24]

Iain Mackie, Jeffrey Dalton, and Andrew Yates. 2021. How deep is your learn- ing: The DL-HARD annotated deep learning dataset. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2335–2341

2021
[25]

Fengran Mo, Kelong Mao, Yutao Zhu, Yihong Wu, Kaiyu Huang, and Jian-Yun Nie. 2023. Convgqr: Generative query reformulation for conversational search. arXiv preprint arXiv:2305.15645(2023)

work page arXiv 2023
[26]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human-generated machine reading comprehension dataset. (2016)

2016
[27]

Jessie Ooi, Xiuqin Ma, Hongwu Qin, and Siau Chuin Liew. 2015. A survey of query expansion, query suggestion and query refinement techniques. In2015 4th International Conference on Software Engineering and Computer Systems (ICSECS). IEEE, 112–117

2015
[28]

OpenAI. 2025. Introducing GPT-4.1 in the API. Online at https://openai.com/index/gpt-4-1/. (2025). Accessed: 2025-12-28; official announcement of the GPT-4.1 large language model

2025
[29]

Dipasree Pal, Mandar Mitra, and Kalyankumar Datta. 2014. Improving query expansion using WordNet.Journal of the Association for Information Science and Technology65, 12 (2014), 2469–2478

2014
[30]

Yonggang Qiu and Hans-Peter Frei. 1993. Concept based query expansion. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval. 160–169

1993
[31]

Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, et al . 2025. Qwen2.5 Technical Report. (2025). https://doi.org/10.48550/arXiv.2412.15115 Technical report describing the Qwen2.5 large language model series

work page Pith review doi:10.48550/arxiv.2412.15115 2025
[32]

Joseph John Rocchio Jr. 1971. Relevance feedback in information retrieval.The SMART retrieval system: experiments in automatic document processing(1971)

1971
[33]

Wonduk Seo and Seunghyun Lee. 2025. QA-Expand: Multi-Question Answer Generation for Enhanced Query Expansion in Information Retrieval.arXiv preprint arXiv:2502.08557(2025)

work page arXiv 2025
[34]

Tao Shen, Guodong Long, Xiubo Geng, Chongyang Tao, Yibin Lei, Tianyi Zhou, Michael Blumenstein, and Daxin Jiang. 2024. Retrieval-Augmented Retrieval: Large Language Models are Strong Zero-Shot Retriever. InFindings of the As- sociation for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computatio...

work page doi:10.18653/v1/2024.findings-acl.943 2024
[35]

Liang Wang, Nan Yang, and Furu Wei. 2023. Query2doc: Query expansion with large language models.arXiv preprint arXiv:2303.07678(2023)

work page arXiv 2023
[36]

Xiao Wang, Sean MacAvaney, Craig Macdonald, and Iadh Ounis. 2023. Generative query reformulation for effective adhoc search.arXiv preprint arXiv:2308.00415 (2023)

work page arXiv 2023
[37]

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. 2024. C-pack: Packed resources for general chinese embeddings. InProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval. 641–649

2024
[38]

neural hype

Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically examining the" neural hype" weak baselines and the additivity of effectiveness gains from neural ranking models. InProceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. 1129–1132

2019
[39]

Le Zhang, Yihong Wu, Qian Yang, and Jian-Yun Nie. 2024. Exploring the Best Practices of Query Expansion with Large Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 1872–1883. https://doi.org/10....

work page doi:10.18653/v1/2024.findings-emnlp 2024
[40]

Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. 2023. Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics11 (2023), 1114–1131

2023