Recognition: unknown
A Reproducibility Study of LLM-Based Query Reformulation
Pith reviewed 2026-05-07 09:00 UTC · model grok-4.3
The pith
Reformulation gains from LLMs in search are strongly conditioned on the retrieval paradigm used.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under a single experimental framework covering two LLM families at two scales, three retrieval paradigms, and nine datasets, LLM query reformulation produces effectiveness gains that depend on the retrieval paradigm, with lexical improvements not transferring consistently to neural retrievers and larger models not uniformly improving results.
What carries the argument
The unified experimental framework that re-implements ten representative LLM reformulation methods with identical prompts, controls, and evaluation across lexical, learned sparse, and dense retrieval on TREC Deep Learning and BEIR collections.
Load-bearing premise
The ten chosen methods adequately represent the space of LLM query reformulation and the single controlled setup removes all original implementation differences.
What would settle it
A follow-up experiment that applies the same ten methods under the same controls but measures consistent positive transfer from lexical to dense retrieval on multiple datasets would falsify the conditioning claim.
Figures
read the original abstract
Large Language Models (LLMs) are now widely used for query reformulation and expansion in Information Retrieval, with many studies reporting substantial effectiveness gains. However, these results are typically obtained under heterogeneous experimental conditions, making it difficult to assess which findings are reproducible and which depend on specific implementation choices. In this work, we present a systematic reproducibility and comparative study of ten representative LLM-based query reformulation methods under a unified and strictly controlled experimental framework. We evaluate methods across two architectural LLM families at two parameter scales, three retrieval paradigms (lexical, learned sparse, and dense), and nine benchmark datasets spanning TREC Deep Learning and BEIR. Our results show that reformulation gains are strongly conditioned on the retrieval paradigm, that improvements observed under lexical retrieval do not consistently transfer to neural retrievers, and that larger LLMs do not uniformly yield better downstream performance. These findings clarify the stability and limits of reported gains in prior work. To enable transparent replication and ongoing comparison, we release all prompts, configurations, evaluation scripts, and run files through QueryGym, an open-source reformulation toolkit with a public leaderboard.\footnote{https://leaderboard.querygym.com}
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a reproducibility study of ten LLM-based query reformulation methods in information retrieval. The authors re-implement these methods under strictly controlled conditions using the same prompts, post-processing steps, two LLM families at two scales, three retrieval paradigms (lexical, learned-sparse, and dense), and nine standard benchmark datasets from TREC Deep Learning and BEIR. Key findings include that reformulation gains are strongly dependent on the retrieval paradigm, lexical gains do not consistently transfer to neural retrievers, and larger LLMs do not always lead to better performance. All artifacts are released via the QueryGym toolkit and a public leaderboard.
Significance. This study is significant because it provides a controlled comparison that explains why prior results on LLM query reformulation have been inconsistent across papers. By isolating the effect of retrieval paradigm and LLM scale, it demonstrates the limits of generalizing gains from lexical to neural settings. The open release of prompts, code, run files, and the leaderboard is a notable strength that supports reproducibility and future work in the field. If the results hold under the released artifacts, this will serve as a valuable reference for the community.
minor comments (3)
- The introduction could more explicitly list the research questions that the unified framework is designed to answer, to better frame the subsequent results sections.
- Tables reporting nDCG@10 and Recall@100 would benefit from including standard deviations or confidence intervals across runs to allow readers to assess the stability of the reported differences between paradigms.
- The QueryGym toolkit is referenced with a URL; a short paragraph in §3 describing its core components (prompt templates, post-processing, evaluation harness) would improve self-containment without relying on the external site.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our reproducibility study, including its significance in explaining inconsistent prior results on LLM query reformulation. The recommendation for minor revision is noted. As the report lists no specific major comments, we have no point-by-point rebuttals to provide and will incorporate any minor editorial suggestions in the revised manuscript.
Circularity Check
No significant circularity
full rationale
This is a purely empirical reproducibility study that re-implements ten existing LLM-based query reformulation methods from prior literature and evaluates them under unified experimental controls across retrieval paradigms, LLM scales, and datasets. The central claims (conditioning of gains on retrieval paradigm, non-transfer from lexical to neural, non-uniform scaling benefits) are direct observations from the new controlled runs rather than quantities defined in terms of fitted parameters or reduced to the paper's own prior results by equation or self-citation chain. No derivation steps exist; the work is self-contained against external benchmarks via released artifacts and does not invoke uniqueness theorems or ansatzes from the authors' own prior papers as load-bearing premises.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Standard TREC and BEIR metrics (nDCG, Recall) are appropriate and comparable across lexical, learned-sparse, and dense retrieval paradigms.
- domain assumption The ten chosen reformulation methods are representative of the LLM-based query reformulation literature.
Reference graph
Works this paper leans on
-
[1]
Nasreen Abdul-Jaleel, James Allan, W Bruce Croft, Fernando Diaz, Leah Larkey, Xiaoyan Li, Mark D Smucker, and Courtney Wade. 2004. UMass at TREC 2004: Novelty and HARD. (2004)
2004
-
[2]
Negar Arabzadeh, Amin Bigdeli, Shirin Seyedsalehi, Morteza Zihayat, and Ebrahim Bagheri. 2021. Matches made in heaven: Toolkit and large-scale datasets for supervised query reformulation. InProceedings of the 30th ACM International Conference on Information & Knowledge Management. 4417–4425. A Reproducibility Study of LLM-Based Query Reformulation SIGIR ’...
2021
-
[3]
Jagdev Bhogal, Andrew MacFarlane, and Peter Smith. 2007. A review of ontology based query expansion.Information processing & management43, 4 (2007), 866– 886
2007
-
[4]
Amin Bigdeli, Negar Arabzadeh, and Ebrahim Bagheri. 2024. Learning to jointly transform and rank difficult queries. InEuropean Conference on Information Retrieval. Springer, 40–48
2024
-
[5]
Amin Bigdeli, Mert Incesu, Negar Arabzadeh, Charles LA Clarke, and Ebrahim Bagheri. 2026. ReFormeR: Learning and Applying Explicit Query Reformulation Patterns. InEuropean Conference on Information Retrieval. Springer, 400–408
2026
-
[6]
Amin Bigdeli, Radin Hamidi Rad, Mert Incesu, Negar Arabzadeh, Charles L. A. Clarke, and Ebrahim Bagheri. 2026. QueryGym: A Toolkit for Reproducible LLM-Based Query Reformulation. InACM Web Conference 2026(2026-01-12). https://doi.org/10.48550/ARXIV.2511.15996
-
[7]
Paolo Boldi, Francesco Bonchi, Carlos Castillo, and Sebastiano Vigna. 2011. Query reformulation mining: models, patterns, and applications.Information retrieval 14, 3 (2011), 257–289
2011
- [8]
- [9]
-
[10]
Van Dang and Bruce W Croft. 2010. Query reformulation using anchor text. In Proceedings of the third ACM international conference on Web search and data mining. 41–50
2010
-
[11]
Alin Deutsch, Lucian Popa, and Val Tannen. 2006. Query reformulation with constraints.ACM SIGMOD Record35, 1 (2006), 65–73
2006
-
[12]
Kaustubh D Dhole and Eugene Agichtein. 2024. Genqrensemble: Zero-shot llm ensemble prompting for generative query reformulation. InEuropean Conference on Information Retrieval. Springer, 326–335
2024
-
[13]
Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant
-
[14]
InProceedings of the 45th international ACM SIGIR conference on research and development in information retrieval
From distillation to hard negative sampling: Making sparse neural ir models more effective. InProceedings of the 45th international ACM SIGIR conference on research and development in information retrieval. 2353–2359
-
[15]
Luyu Gao, Zhuyun Dai, Tongfei Chen, Zhen Fan, Benjamin Van Durme, and Jamie Callan. 2021. Complement lexical retrieval model with semantic residual embeddings. InEuropean Conference on Information Retrieval. Springer, 146–160
2021
-
[16]
Seyed Mohammad Hosseini, Negar Arabzadeh, Morteza Zihayat, and Ebrahim Bagheri. 2024. Enhanced retrieval effectiveness through selective query genera- tion. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 3792–3796
2024
- [17]
-
[18]
Ehsan Kamalloo, Nandan Thakur, Carlos Lassance, Xueguang Ma, Jheng-Hong Yang, and Jimmy Lin. 2024. Resources for Brewing BEIR: Reproducible Reference Models and Statistical Analyses. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24). Association for Computing Machinery, New York, ...
-
[19]
Yibin Lei, Yu Cao, Tianyi Zhou, Tao Shen, and Andrew Yates. 2024. Corpus- Steered Query Expansion with Large Language Models. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), Yvette Graham and Matthew Purver (Eds.). Association for Computational Linguistics, St. Julian’...
-
[20]
Jimmy Lin. 2019. The neural hype and comparisons against weak baselines. In Acm sigir forum, Vol. 52. ACM New York, NY, USA, 40–51
2019
-
[21]
Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Infor- mation Retrieval Research with Sparse and Dense Representations. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’21). Association f...
-
[22]
Meili Lu, Xiaobing Sun, Shaowei Wang, David Lo, and Yucong Duan. 2015. Query expansion via wordnet for effective code search. In2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 545–549
2015
-
[23]
Xueguang Ma, Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. 2022. Document expansion baselines and learned sparse lexical representations for ms marco v1 and v2. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3187–3197
2022
-
[24]
Iain Mackie, Jeffrey Dalton, and Andrew Yates. 2021. How deep is your learn- ing: The DL-HARD annotated deep learning dataset. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2335–2341
2021
- [25]
-
[26]
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human-generated machine reading comprehension dataset. (2016)
2016
-
[27]
Jessie Ooi, Xiuqin Ma, Hongwu Qin, and Siau Chuin Liew. 2015. A survey of query expansion, query suggestion and query refinement techniques. In2015 4th International Conference on Software Engineering and Computer Systems (ICSECS). IEEE, 112–117
2015
-
[28]
OpenAI. 2025. Introducing GPT-4.1 in the API. Online at https://openai.com/index/gpt-4-1/. (2025). Accessed: 2025-12-28; official announcement of the GPT-4.1 large language model
2025
-
[29]
Dipasree Pal, Mandar Mitra, and Kalyankumar Datta. 2014. Improving query expansion using WordNet.Journal of the Association for Information Science and Technology65, 12 (2014), 2469–2478
2014
-
[30]
Yonggang Qiu and Hans-Peter Frei. 1993. Concept based query expansion. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval. 160–169
1993
-
[31]
Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, et al . 2025. Qwen2.5 Technical Report. (2025). https://doi.org/10.48550/arXiv.2412.15115 Technical report describing the Qwen2.5 large language model series
-
[32]
Joseph John Rocchio Jr. 1971. Relevance feedback in information retrieval.The SMART retrieval system: experiments in automatic document processing(1971)
1971
- [33]
-
[34]
Tao Shen, Guodong Long, Xiubo Geng, Chongyang Tao, Yibin Lei, Tianyi Zhou, Michael Blumenstein, and Daxin Jiang. 2024. Retrieval-Augmented Retrieval: Large Language Models are Strong Zero-Shot Retriever. InFindings of the As- sociation for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computatio...
- [35]
- [36]
-
[37]
Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. 2024. C-pack: Packed resources for general chinese embeddings. InProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval. 641–649
2024
-
[38]
neural hype
Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically examining the" neural hype" weak baselines and the additivity of effectiveness gains from neural ranking models. InProceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. 1129–1132
2019
-
[39]
Le Zhang, Yihong Wu, Qian Yang, and Jian-Yun Nie. 2024. Exploring the Best Practices of Query Expansion with Large Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 1872–1883. https://doi.org/10....
-
[40]
Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. 2023. Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics11 (2023), 1114–1131
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.