arxiv: 2605.12028 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.IR

Recognition: no theorem link

Caraman at SemEval-2026 Task 8: Three-Stage Multi-Turn Retrieval with Query Rewriting, Hybrid Search, and Cross-Encoder Reranking

David-Maximilian Caraman , Gheorghe Cosmin Silaghi

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:43 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords query rewritinghybrid retrievalcross-encoder rerankingmulti-turn retrievalLoRA fine-tuningtemperature tuningnDCG evaluationshared-task system

0 comments

The pith

A three-stage pipeline with query rewriting, hybrid search, and cross-encoder reranking achieves 0.531 nDCG@5 and eighth place in multi-turn retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a competitive system for the SemEval-2026 multi-turn retrieval task across four English domains. The approach consists of rewriting context-dependent follow-up questions into standalone queries with a fine-tuned language model, followed by hybrid keyword and vector retrieval, and finally reranking the candidates. Development experiments demonstrate that tuning the generation temperature separately for technical and general domains improves results, whereas more elaborate prompting and expansion techniques lower performance. The submitted run reaches an nDCG@5 of 0.531 on the test set, placing eighth among 38 teams and 10.7 percent above the baseline. Readers interested in conversational search would find value in the explicit pipeline and the empirical preference for targeted tuning over added complexity.

Core claim

The central claim is that a three-stage pipeline consisting of LoRA-fine-tuned query rewriting, reciprocal-rank-fusion hybrid retrieval, and cross-encoder reranking produces strong results on the official test set. The system attains an nDCG@5 score of 0.531, ranking eighth out of thirty-eight participating systems and exceeding the organizer baseline by 10.7 percent. Experiments on the development set indicate that domain-specific temperature settings for the query generator improve accuracy, with deterministic decoding favored in technical domains and modest randomness in general domains, while domain-aware prompting and multi-query expansion reduce effectiveness.

What carries the argument

The three-stage retrieval pipeline that rewrites follow-up questions into independent queries, performs hybrid BM25 and dense retrieval fused by reciprocal rank fusion, and applies cross-encoder reranking to improve final rankings.

If this is right

Query rewriting is required to turn context-dependent follow-up questions into effective standalone queries for retrieval.
Combining BM25 and dense retrieval through reciprocal rank fusion yields better initial rankings than either method alone.
Cross-encoder reranking after hybrid retrieval further improves final ranking quality.
Domain-specific decoding temperature in query generation raises overall retrieval metrics more reliably than uniform settings.
Adding domain-aware prompting or multi-query expansion to the pipeline degrades rather than improves performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged approach could be adapted to other conversational search applications by swapping in domain-appropriate base models for rewriting.
The observation that simpler tuned components outperform added complexity points to a general preference for focused engineering in retrieval pipelines.
Developers might test whether the temperature-tuning rule transfers to new languages or non-English domains without retraining the entire system.
Future work could replace the separate rewriting stage with an integrated model that jointly handles context and retrieval.

Load-bearing premise

The reported gains from domain-specific temperature tuning in query rewriting generalize beyond the development set and are not due to overfitting or unaccounted implementation details.

What would settle it

Evaluating the pipeline on the test set with uniform temperature across domains instead of domain-specific values; failure to maintain the 10.7 percent margin would show that the tuning benefit does not generalize.

Figures

Figures reproduced from arXiv: 2605.12028 by David-Maximilian Caraman, Gheorghe Cosmin Silaghi.

read the original abstract

We describe our system for SemEval-2026 Task 8 (MTRAGEval), participating in Task A (Retrieval) across four English-language domains. Our approach employs a three-stage pipeline: (1) query rewriting via a LoRA-fine-tuned Qwen 2.5 7B model that transforms context-dependent follow-up questions into standalone queries, (2) hybrid BM25 and dense retrieval combined through Reciprocal Rank Fusion, and (3) cross-encoder reranking with BGE-reranker-v2-m3. On the official test set, the system achieves nDCG@5 of 0.531, ranking 8th out of 38 participating systems and 10.7% above the organizer baseline. Development comparisons reveal that domain-specific temperature tuning for query generation, where technical domains benefit from deterministic decoding and general domains from controlled randomness, provides consistent gains, while more complex strategies such as domain-aware prompting and multi-query expansion degrade performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a competent shared-task system paper that combines standard retrieval pieces into a working pipeline and reports a narrow but practical note on domain-specific temperature tuning for query rewriting.

read the letter

The main takeaway is that the authors assembled a three-stage retrieval system for SemEval-2026 Task 8 and landed in the middle of the pack with an official test nDCG@5 of 0.531, eighth out of 38 teams and roughly 11 percent above the baseline. The pipeline runs query rewriting through a LoRA-fine-tuned Qwen 2.5 7B, then hybrid BM25 plus dense retrieval fused by reciprocal rank fusion, and finally BGE cross-encoder reranking. That setup is described clearly enough to replicate in principle.

Referee Report

0 major / 4 minor

Summary. The paper describes a three-stage pipeline for SemEval-2026 Task 8 (MTRAGEval) Task A across four English domains: (1) query rewriting via LoRA-fine-tuned Qwen 2.5 7B to convert follow-up questions to standalone queries, (2) hybrid BM25 + dense retrieval fused by Reciprocal Rank Fusion, and (3) reranking with BGE-reranker-v2-m3. It reports an official-test nDCG@5 of 0.531 (8th of 38 systems, +10.7% over baseline) and notes that domain-specific temperature tuning during query generation improves development-set results while more complex strategies do not.

Significance. The externally validated test-set ranking constitutes a clear, reproducible contribution to conversational retrieval benchmarks. The pipeline is a standard yet effective combination of components, and the temperature-tuning heuristic supplies a low-cost, domain-aware insight that could be tested in other multi-turn settings. Credit is due for grounding the headline result in the organizers' held-out evaluation rather than internal overfitting.

minor comments (4)

Abstract: the manuscript states that domain-specific temperature tuning 'provides consistent gains' but supplies neither the magnitude of those gains, the exact temperature values per domain, nor any ablation or statistical test, making it impossible to assess whether the observation is robust or merely anecdotal.
Abstract: no information is given on the training data or hyperparameters used for the LoRA fine-tuning of Qwen 2.5 7B, nor on the precise BM25/dense-retrieval weights or fusion parameters, which limits reproducibility of the reported pipeline.
Abstract: the four English-language domains are not named, so the claim that technical domains benefit from deterministic decoding while general domains benefit from controlled randomness cannot be evaluated or replicated by readers.
The manuscript would be strengthened by a table or figure showing component-wise ablations on the development set and by explicit comparison against the organizer baseline in the same table.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our contribution, particularly the recognition of the externally validated test-set ranking and the practical value of the domain-specific temperature tuning heuristic. The recommendation for minor revision is appreciated, and we will incorporate any editorial adjustments in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is a shared-task system description whose central claim is an externally evaluated nDCG@5 score on the organizers' held-out test set. The three-stage pipeline (LoRA query rewriting, hybrid BM25+dense retrieval with RRF, cross-encoder reranking) is assembled from standard components; domain-specific temperature observations are reported only as dev-set findings and are not required to produce or justify the official test metric. No equations, self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations reduce any result to its own inputs by construction. The evaluation pipeline is independent of the authors' internal choices.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the effectiveness of standard retrieval components plus empirical tuning; no new entities or unstated axioms are introduced beyond the assumption that the described pipeline generalizes.

free parameters (1)

domain-specific temperature for query generation
Tuned separately for technical versus general domains on development data to control determinism versus randomness.

pith-pipeline@v0.9.0 · 5491 in / 1111 out tokens · 55086 ms · 2026-05-13T04:43:49.140255+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

[1]

Transactions of the Association for Computational Linguistics , volume =

Katsis, Yannis and Rosenthal, Sara and Fadnis, Kshitij and Gunasekara, Chulaka and Lee, Young-Suk and Popa, Lucian and Shah, Vraj and Zhu, Huaiyu and Contractor, Danish and Danilevsky, Marina , title =. Transactions of the Association for Computational Linguistics , volume =. 2025 , month =. doi:10.1162/TACL.a.19 , url =

work page doi:10.1162/tacl.a.19 2025
[2]

2602.23184 , archivePrefix =

Sara Rosenthal and Yannis Katsis and Vraj Shah and Lihong He and Lucian Popa and Marina Danilevsky , year =. 2602.23184 , archivePrefix =

work page arXiv
[3]

2026 , organization =

Rosenthal, Sara and Shah, Vraj and Katsis, Yannis and Danilevsky, Marina , booktitle =. 2026 , organization =

work page 2026
[4]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =. 2022 , url =

work page 2022
[5]

The Probabilistic Relevance Framework:

Robertson, Stephen and Zaragoza, Hugo , journal =. The Probabilistic Relevance Framework:

work page
[6]

arXiv preprint arXiv:2508.21085 , year =

Granite Embedding. arXiv preprint arXiv:2508.21085 , year =

work page arXiv
[7]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , booktitle =

Chen, Jianlv and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng , editor =. Findings of the Association for Computational Linguistics: ACL 2024 , month = aug, year =. doi:10.18653/v1/2024.findings-acl.137 , pages =

work page doi:10.18653/v1/2024.findings-acl.137 2024
[8]

and Clarke, Charles L

Cormack, Gordon V. and Clarke, Charles L. A. and Buettcher, Stefan , booktitle =. Reciprocal Rank Fusion Outperforms. 2009 , url =

work page 2009
[9]

Passage Re-ranking with

Nogueira, Rodrigo and Cho, Kyunghyun , journal =. Passage Re-ranking with. 2019 , url =

work page 2019
[10]

Sentence-

Reimers, Nils and Gurevych, Iryna , booktitle =. Sentence-. 2019 , url =

work page 2019
[11]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , url =

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Billion-Scale Similarity Search with

Johnson, Jeff and Douze, Matthijs and J. Billion-Scale Similarity Search with. IEEE Transactions on Big Data , volume =. 2021 , url =

work page 2021
[13]

Proceedings of the 14th ACM International Conference on Web Search and Data Mining , pages =

Question Rewriting for Conversational Question Answering , author =. Proceedings of the 14th ACM International Conference on Web Search and Data Mining , pages =. 2021 , url =

work page 2021
[14]

Cumulated Gain-Based Evaluation of

J. Cumulated Gain-Based Evaluation of. ACM Transactions on Information Systems , volume =. 2002 , url =

work page 2002
[15]

(2024) Bm25s: Orders of magnitude faster lexical search via eager sparse scoring

L. arXiv preprint arXiv:2407.03618 , year =

work page arXiv
[16]

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval , pages =

Pytrec\_eval: An Extremely Fast Python Interface to trec\_eval , author =. SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval , pages =. 2018 , url =

work page 2018