arxiv: 2605.04523 · v1 · submitted 2026-05-06 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation

Ivan Bondarenko , Roman Derunets , Oleg Sedukhin , Mikhail Komarov , Ivan Chernov , Mikhail Kulakov

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords LLM ensemblefaithful response generationmulti-turn dialoguejudge modelSemEval taskMeno-Lite modelreference passages

0 comments

The pith

A judge-orchestrated ensemble of seven LLMs produced the top-scoring faithful responses in a multi-turn generation task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a system that generates candidate responses from seven different large language models using two distinct prompting approaches, then relies on a GPT-4o-mini model to select the single best output for each instance. This method placed first among twenty-six teams on the reference-passage version of the SemEval task, reaching a conditioned harmonic mean of 0.7827 while surpassing the strongest baseline model by a wide margin. The authors demonstrate through ablations that combining models from varied families, scales, and prompt styles consistently outperforms any lone model, and they release an additional 7B domain-adapted model called Meno-Lite-0.1 that balances cost and performance.

Core claim

The central claim is that a heterogeneous ensemble of LLMs, with a smaller judge model selecting the most faithful candidate per turn, generates higher-quality multi-turn responses than any individual model when reference passages are provided.

What carries the argument

The GPT-4o-mini judge that ranks and selects among candidates produced by the seven-LLM ensemble with two prompting variants.

If this is right

Mixing model families and prompting strategies yields gains that no single model achieves alone.
A compact judge model can effectively curate outputs from much larger generators.
Domain-adapted 7B models can serve as strong, lower-cost components within such ensembles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same judge-selection pattern could transfer to other faithfulness-critical generation settings outside this specific benchmark.
Reducing reliance on proprietary models by swapping in more open-source generators would test the approach's robustness.
Annotation limitations noted in the task analysis suggest that improved evaluation protocols might further clarify where ensembles add the most value.

Load-bearing premise

The smaller judge model can reliably identify which candidate response is most faithful to the reference passages.

What would settle it

Human raters or a different judge model systematically preferring outputs that the GPT-4o-mini ranked lower on the same test set would falsify the selection mechanism.

Figures

Figures reproduced from arXiv: 2605.04523 by Ivan Bondarenko, Ivan Chernov, Mikhail Komarov, Mikhail Kulakov, Oleg Sedukhin, Roman Derunets.

**Figure 1.** Figure 1: Overview of the proposed framework. Stage (a): Two prompting strategies – an iteratively refined system prompt and a category-aware few-shot variant. Stage (b): Seven LLMs from two groups generate candidate responses. Stage (c): GPT-4o-mini evaluates all candidates and selects the best response. 3.3 Judge-Based Selection For the judge, we use GPT-4o-mini, a well-known model that also served as one of the j… view at source ↗

**Figure 2.** Figure 2: Contribution of each model to the weighted view at source ↗

read the original abstract

We present our winning system for Task~B (generation with reference passages) in SemEval-2026 Task~8: MTRAGEval. Our method is a heterogeneous ensemble of seven LLMs with two prompting variants, where a GPT-4o-mini judge selects the best candidate per instance. We ranked 1st out of 26 teams, achieving a conditioned harmonic mean of 0.7827 and outperforming the strongest baseline (gpt-oss-120b, 0.6390). Ablations show that diversity in model families, scales, and prompting strategies is essential, with the ensemble consistently beating any single model. We also introduce Meno-Lite-0.1, a 7B domain-adapted model with a strong cost--performance trade-off, and analyse MTRAGEval, highlighting annotation limitations and directions for improvement. Our code is publicly available: https://github.com/RaguTeam/ragu_mtrag_semeval

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript reports on RaguTeam's winning system for Task B (generation with reference passages) in SemEval-2026 Task 8: MTRAGEval. It describes a heterogeneous ensemble of seven LLMs using two prompting variants, with a GPT-4o-mini judge selecting the best candidate per instance. The system ranked 1st out of 26 teams with a conditioned harmonic mean of 0.7827, outperforming the strongest baseline (gpt-oss-120b at 0.6390). The paper presents ablations showing the necessity of diversity in model families, scales, and prompting strategies, introduces the 7B domain-adapted Meno-Lite-0.1 model, analyzes annotation limitations in MTRAGEval, and releases code at https://github.com/RaguTeam/ragu_mtrag_semeval.

Significance. If the results hold, the work is significant for providing externally verified evidence that judge-orchestrated LLM ensembles with model and prompt diversity can substantially improve faithfulness in multi-turn response generation. The competition ranking, public code release, and introduction of a cost-effective domain-adapted model (Meno-Lite-0.1) are explicit strengths that support reproducibility and practical utility. The benchmark analysis offers concrete directions for improving evaluation protocols in the field.

minor comments (2)

[Method] The method section does not include the exact prompt templates or selection criteria used by the GPT-4o-mini judge. Adding these (or a pointer to their location in the public repository) would clarify how faithfulness is operationalized during candidate selection.
[Ablations] The ablations subsection reports that the ensemble outperforms single models but does not tabulate the per-model or per-prompt scores. Including these raw numbers would make the contribution of diversity more transparent and easier to compare against future work.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the work's significance, and recommendation for minor revision. The report does not include any specific major comments to address point by point.

Circularity Check

0 steps flagged

No significant circularity; result is externally benchmarked

full rationale

The paper's central claim is a 1st-place ranking (0.7827 conditioned harmonic mean) on the official SemEval-2026 Task 8 leaderboard, determined by organizers' evaluation protocol against an external baseline (gpt-oss-120b at 0.6390). The system description (heterogeneous LLM ensemble + GPT-4o-mini judge for candidate selection) contains no equations, fitted parameters renamed as predictions, or load-bearing self-citations. Ablations and model introduction (Meno-Lite-0.1) are empirical observations, not derivations that reduce to inputs by construction. The outcome is independently verifiable via the public leaderboard and released code.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work relies on existing LLMs, standard prompting, and an off-the-shelf judge model without new postulates.

pith-pipeline@v0.9.0 · 5503 in / 1086 out tokens · 37395 ms · 2026-05-08T16:58:37.741195+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 4 canonical work pages · 2 internal anchors

[1]

ArXivabs/2010.10439(2020),https://api

TabaQA at SemEval-2025 task 8: Column augmented generation for question answering over tabular data. InProceedings of the 19th Interna- tional Workshop on Semantic Evaluation (SemEval- 2025), pages 937–952, Vienna, Austria. Association for Computational Linguistics. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, ...

work page arXiv 2025
[2]

2602.23184 , archivePrefix =

Hybrid ai for responsive multi-turn online conversations with novel dynamic routing and feed- back adaptation. InProceedings of the 4th Interna- tional Workshop on Knowledge-Augmented Methods for Natural Language Processing, pages 215–229. Guilherme Penedo, Hynek Kydlíˇcek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro V on Werra, Thomas Wolf,...

work page arXiv 2024
[3]

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

Improving contextual query rewrite for conver- sational AI agents through user-preference feedback learning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 432–439, Singapore. Associa- tion for Computational Linguistics. Yixuan Tang and Yi Yang. 2024. Multihop-rag: Bench- marking retrieval-a...

work page internal anchor Pith review arXiv 2023
[4]

Qwen3 technical report.arXiv preprint arXiv:2505.09388. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Hao- ran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, and 43 others. 2024. Qwen2 technical report.Preprint, arXiv:240...

work page internal anchor Pith review arXiv 2024
[5]

recollect

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information pro- cessing systems, 36:46595–46623. A Meno-Lite-0.1 Details Design philosophy.Meno-Lite-0.1 is a 7B model deliberately optimised for context-grounded tasks – RAG question answering, information extraction, multi-hop reasoning, and knowledge graph con- struction – rathe...

2025
[6]

The agent syn- thesises fragmented advice into declarative recom- mendations,

(Russian). Meno-Lite-0.1 achieves an overall score of 0.555, matching Llama-3.3-70B- Instruct (0.555) and substantially outperforming the base Qwen2.5-7B-Instruct (0.482). On the MultiQ multi-hop QA subtask – directly relevant to RAG capability – it scores 0.536/0.403, ap- proaching GPT-4o (0.572/0.431), while its world- knowledge score (CheGeKa: 0.346/0....

2025
[7]

,→ ,→ ,→

The agent synthesizes fragmented advice from the source text into declarative, actionable recommendations rather than quoting the conversational format of the documents. ,→ ,→ ,→
[8]

SBI" or

It adopts a neutral, professional persona, stripping away first-person narratives or specific geographic examples (like "SBI" or "UK sole trader") found in the context to provide generalized advice. ,→ ,→ ,→ ,→
[9]

,→ ,→ ,→

The agent prioritizes strict adherence to the provided context, avoiding the addition of external world knowledge or hallucinations not supported by the documents. ,→ ,→ ,→
[10]

,→ ,→ ,→

It selectively extracts and combines key constraints from multiple documents (e.g., separating accounts and not mixing expenses) to form a cohesive answer. ,→ ,→ ,→
[11]

,→ ,→ ,→ Case 2:

The agent answers binary questions directly (starting with "Yes") and immediately supports the decision with reasoning derived specifically from the retrieved text. ,→ ,→ ,→ Case 2:
[12]

,→ ,→ ,→

The agent relies on extractive generation, lifting sentences almost verbatim from the provided documents rather than synthesizing or paraphrasing the information. ,→ ,→ ,→
[13]

Economy of India

It concatenates excerpts from multiple documents without smoothing transitions, occasionally retaining non-narrative artifacts like document titles (e.g., "Economy of India") within the response. ,→ ,→ ,→ ,→ Version Description HM 3 P0 Bare prompt 0.721 PLLM-synthesised 0.757 P2 Self-critiqued revision ofP0.648 Table 5: System prompt quality across design...
[14]

second largest

The agent prioritizes retrieving context containing keywords over directly answering the specific question, providing related facts (like the "second largest" employer) rather than explicitly isolating the requested entity. ,→ ,→ ,→ ,→ ,→
[15]

provide the most helpful answer possible

It includes tangential statistical data found in the text, such as self-employment percentages, even if this information does not strictly answer the user's specific inquiry. ,→ ,→ ,→ ,→ [... remaining cases turns omitted for brevity ...],→ This yields the initial system prompt Pt. A sec- ond round of self-critique – where Gemini evalu- ates its own synth...
[16]

**Strict Context Adherence:** Rely exclusively on the,→ provided documents for all facts and reasoning, ignoring,→ external world knowledge even for common topics; if the,→ specific information is not in the text, do not invent it.,→
[17]

**Extractive Phrasing:** Prioritize lifting phrases,,→ sentence structures, and terminology verbatim from the,→ source text rather than generating novel summaries; limit,→ paraphrasing to minor grammatical adjustments necessary,→ for flow
[18]

I recommend

**Depersonalized Synthesis:** Combine information from,→ multiple document chunks into a cohesive answer, converting,→ first-person narratives (e.g., "I recommend") found in the,→ text into neutral, objective statements or second-person,→ advice
[19]

According,→ to the text

**Direct and Uncited Delivery:** Provide immediate,,→ concise answers without conversational fillers, introductions, or meta-references (e.g., avoid "According,→ to the text"); for binary questions, start immediately,→ with "Yes" or "No" followed by the evidence
[20]

**Fidelity to Structure:** Preserve the original,→ formatting (such as lists) and granular technical details,→ from the documents, presenting them exactly as they appear,→ in the source material rather than simplifying or abstracting them. D Few-Shot Exemplars for Group 2 Whereas Group 1 relies solely on a system prompt (Appendix C), Group 2 augments the ...
[21]

Empty context– the model has no documents to ground its answer in, regardless of dialogue history
[22]

Empty dialogue history– documents are available but the conversation has just started (first turn)
[23]

Non-empty context and non-empty dia- logue history– the most common and chal- lenging setting, requiring the model to inte- grate retrieved passages with prior conversa- tional context. From each category we select themedoid– the ex- ample with the smallest total embedding distance to all other examples in the same category – en- suring that the represent...
[24]

The document 1: [IBM Cloudant as a document store and query engine ...]
[25]

it” → “groundwa- ter contamination

The document 2: [IBM Cloudant databases contain JSON objects called documents ...] The user's current question: does IBM offer document databases? Assistant: Yes, IBM offers IBM Cloudant which stores JSON objects. Category 3: Non-empty context and non-empty dialogue history.This is the most frequent and challenging setting: the model must resolve co- refe...
[26]

The document 1: [Adding a skill to your assistant: dialog skill and search skill ...]
[27]

The document 2: [A dialog skill has the training data and dialog that your assistant uses ...]
[28]

Canitbe clean up?

The document 3: [Adding a dialog skill: the NLP for Watson Assistant is defined in a dialog skill ...] The user's current question: Are dialogue skills necessary? Assistant: Yes. Dialog Skill contains the training data and dialog that your assistant uses to chat with customers. It is a container for all the artifacts that define a conversation flow. Custo...
[29]

The document 1: [FAQs about groundwater: toxins, cleanup, and water fluoridation ...]
[30]

I do not know

The document 2: [FAQs continued: dos and don'ts, fluoridation, Garvey Reservoir ...] The user's current question: Can it be clean up? Assistant: Yes, contaminated groundwater can be cleaned up, but it can be a time-consuming and expensive process. Communities whose drinking water source has become contaminated have had to spend millions of dollars to remo...

2020