arxiv: 2603.27253 · v2 · submitted 2026-03-28 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Mitigating Hallucination on Hallucination in RAG via Ensemble Voting

Zequn Xie, Zhengyang Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:41 UTC · model grok-4.3

classification 💻 cs.CL

keywords RAGhallucination mitigationensemble votingretrieval-augmented generationLLM reliabilityvoting mechanismparallel agents

0 comments

The pith

VOTE-RAG reduces compounded RAG hallucinations by voting across multiple retrieval queries and independent answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Retrieval-augmented generation can produce worse errors when flawed retrieved documents steer the model into further mistakes. VOTE-RAG counters this with a training-free two-stage process: multiple agents first issue diverse queries in parallel and pool the returned documents, then multiple agents generate answers from that pool and the system outputs the majority choice. Experiments across six benchmark datasets show the method performs at least as well as more elaborate frameworks while remaining simpler and fully parallelizable. The design also sidesteps problem drift because it never iteratively alters the original query.

Core claim

VOTE-RAG is a two-stage ensemble voting framework that first aggregates documents through parallel retrieval voting with diverse queries and then resolves answers through response voting by majority among independently generated outputs, achieving performance comparable to or surpassing more complex frameworks on six benchmark datasets.

What carries the argument

Two-stage voting mechanism: retrieval voting pools documents from multiple parallel diverse queries, followed by response voting that selects the majority answer from independent generations based on the pooled documents.

If this is right

Performance matches or exceeds that of more complex RAG frameworks on six standard benchmarks.
The architecture stays simpler and remains fully parallelizable, avoiding sequential refinement steps.
Problem drift risk disappears because the original query is never altered during the process.
No training or fine-tuning is required, allowing direct use with existing models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same voting pattern could be applied to reduce other LLM consistency failures outside retrieval settings.
Increasing the number of parallel agents may improve accuracy further provided compute budgets allow.
The method can wrap existing RAG pipelines with only minor changes to query and answer generation calls.

Load-bearing premise

Majority voting among independently generated responses will reliably select the correct answer when the retrieved documents contain misleading content that could prompt consistent hallucinations.

What would settle it

A controlled run on any of the six benchmarks in which a majority of agents produce the same incorrect answer while a minority produces the correct one, causing the final vote to select the error.

Figures

Figures reproduced from arXiv: 2603.27253 by Zequn Xie, Zhengyang Sun.

**Figure 1.** Figure 1: An overview of our VOTE-RAG framework. It leverages parallel ensemble voting to enhance retrieval breadth and generation robustness against [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) aims to reduce hallucinations in Large Language Models (LLMs) by integrating external knowledge. However, RAG introduces a critical challenge: hallucination on hallucination," where flawed retrieval results mislead the generation model, leading to compounded hallucinations. To address this issue, we propose VOTE-RAG, a novel, training-free framework with a two-stage structure and efficient, parallelizable voting mechanisms. VOTE-RAG includes: (1) Retrieval Voting, where multiple agents generate diverse queries in parallel and aggregate all retrieved documents; (2) Response Voting, where multiple agents independently generate answers based on the aggregated documents, with the final output determined by majority vote. We conduct comparative experiments on six benchmark datasets. Our results show that VOTE-RAG achieves performance comparable to or surpassing more complex frameworks. Additionally, VOTE-RAG features a simpler architecture, is fully parallelizable, and avoids the problem drift" risk. Our work demonstrates that simple, reliable ensemble voting is a superior and more efficient method for mitigating RAG hallucinations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VOTE-RAG is a straightforward two-stage voting method for RAG hallucinations whose performance claims rest on benchmark numbers not shown in the abstract.

read the letter

VOTE-RAG is basically a two-part voting system to cut down on RAG hallucinations. It first has multiple agents make different queries to pull more documents, then has them generate answers from the combined set and takes a majority vote. That's the core idea. What stands out is how straightforward it is. No fine-tuning, everything runs in parallel, and it avoids some drift problems that more complicated methods might have. The authors position it as competitive with heavier frameworks on six benchmarks, which is the kind of practical claim that could matter for real deployments. They also highlight the training-free aspect, which keeps things efficient. The soft spot is the lack of visible evidence. The abstract talks about results but doesn't show any numbers, error bars, or how they handle ties in voting. More importantly, the stress test point is fair: if the retrieval is bad, all the agents might generate the same wrong answer because they're using the same model and context. Majority vote would just lock in the hallucination instead of fixing it. Without per-case analysis or agreement stats, it's tough to know how often that happens. The paper seems to assume the correct answer will be the most common one, but that might not hold when the context is misleading. This paper is aimed at people building RAG systems who want something lightweight. A reader interested in ensemble methods for generation reliability would get some value from the framing, even if the technique itself builds on known ideas like multi-query retrieval and response ensembling. The approach is procedural with no parameters to fit, which is a plus for reproducibility. I'd say it deserves a serious referee. The idea is clear enough and the problem is real, so let them show the data in review and address whether the voting actually breaks the consistent error cases.

Referee Report

3 major / 1 minor

Summary. The paper proposes VOTE-RAG, a training-free two-stage ensemble framework to mitigate 'hallucination on hallucination' in RAG. Stage 1 (Retrieval Voting) generates diverse queries in parallel and aggregates retrieved documents; Stage 2 (Response Voting) has multiple agents independently generate answers from the aggregated context and selects the majority-vote output. The central claim is that this achieves performance comparable to or better than more complex methods on six benchmarks while remaining simpler, fully parallelizable, and free of problem-drift risk.

Significance. If the empirical claims hold with proper controls and statistics, the work would demonstrate that lightweight, training-free majority voting can reliably outperform or match elaborate RAG variants, offering a practical, scalable baseline for hallucination mitigation that emphasizes simplicity and reproducibility.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: comparative results on six benchmarks are asserted without any reported metrics, baselines, error bars, statistical significance tests, or per-dataset tables, so the performance claim cannot be evaluated and is load-bearing for the entire contribution.
[Section 3.2] Response Voting description (Section 3.2): the mechanism assumes the correct answer remains the mode even when retrieval contains misleading documents, yet no agreement rates, tie-resolution procedure, number of agents, or error-case analysis (e.g., instances where consistent hallucinations outvote the truth) are provided; this directly tests the core 'hallucination on hallucination' mitigation hypothesis.
[Section 3 / Experiments] Method and Experiments: no specification of how query diversity is generated, how many agents are used, or how the aggregated document set is truncated, all of which affect both the parallelizability claim and the reproducibility of the reported gains.

minor comments (1)

[Abstract] The phrase 'problem drift' risk is placed in quotes in the abstract but never defined or contrasted with the proposed method.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important gaps in the presentation of our results and implementation details. We will revise the manuscript to address each point, adding the necessary tables, specifications, and analyses to strengthen the empirical support and reproducibility of VOTE-RAG.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: comparative results on six benchmarks are asserted without any reported metrics, baselines, error bars, statistical significance tests, or per-dataset tables, so the performance claim cannot be evaluated and is load-bearing for the entire contribution.

Authors: We agree that the experimental claims require fuller documentation to be evaluable. In the revised manuscript we will insert a main results table (and appendix tables) reporting exact metrics for VOTE-RAG and every baseline on each of the six datasets, include standard-error bars from multiple runs, and add paired statistical significance tests (e.g., McNemar or t-tests) against the strongest baselines. revision: yes
Referee: [Section 3.2] Response Voting description (Section 3.2): the mechanism assumes the correct answer remains the mode even when retrieval contains misleading documents, yet no agreement rates, tie-resolution procedure, number of agents, or error-case analysis (e.g., instances where consistent hallucinations outvote the truth) are provided; this directly tests the core 'hallucination on hallucination' mitigation hypothesis.

Authors: We will expand Section 3.2 with the missing specifications: number of response agents (5), tie-resolution rule (highest average token-level confidence, else random among tied answers), and per-instance agreement rates. We will also add a dedicated error-analysis subsection that quantifies cases in which consistent hallucinations outvote the correct answer, thereby directly testing the core hypothesis. revision: yes
Referee: [Section 3 / Experiments] Method and Experiments: no specification of how query diversity is generated, how many agents are used, or how the aggregated document set is truncated, all of which affect both the parallelizability claim and the reproducibility of the reported gains.

Authors: We accept that these implementation details are required for reproducibility. The revision will specify: (i) query diversity generation via prompt paraphrasing and temperature sampling, (ii) exact agent counts (4 for retrieval voting, 5 for response voting), and (iii) truncation of the aggregated document pool to the top-10 unique passages after deduplication. These additions will also clarify the parallel execution schedule. revision: yes

Circularity Check

0 steps flagged

No circularity: VOTE-RAG is a direct procedural ensemble method with no derivation chain

full rationale

The paper describes VOTE-RAG as a training-free, two-stage procedural framework consisting of parallel query generation for retrieval aggregation followed by independent response generation and majority voting. No equations, fitted parameters, ansatzes, or first-principles derivations are present. Performance claims rest on empirical benchmark comparisons rather than any 'prediction' that reduces to the method's own inputs by construction. No self-citations, uniqueness theorems, or load-bearing references to prior author work are invoked to justify core steps. The central mechanism (majority vote over LLM outputs) is presented as an explicit algorithmic choice, not derived from or equivalent to its own outputs. This is a standard non-circular empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that independent generations will produce a detectable majority for correct answers even under noisy retrieval; no free parameters or new entities are introduced.

axioms (1)

domain assumption Majority vote among independent generations selects the non-hallucinated answer when retrieval is flawed.
Invoked in the description of Response Voting stage; no empirical validation or proof supplied in abstract.

pith-pipeline@v0.9.0 · 5482 in / 1217 out tokens · 34668 ms · 2026-05-14T22:41:13.510030+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
VOTE-RAG includes: (1) Retrieval Voting, where multiple agents generate diverse queries in parallel and aggregate all retrieved documents; (2) Response Voting, where multiple agents independently generate answers based on the aggregated documents, with the final output determined by majority vote.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
simple, reliable ensemble voting is a superior and more efficient method for mitigating RAG hallucinations

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agentic Retrieval-Augmented Generation for Financial Document Question Answering
cs.AI 2026-05 unverdicted novelty 6.0

FinAgent-RAG achieves 76.81-78.46% execution accuracy on financial QA benchmarks by combining contrastive retrieval, program-of-thought code generation, and adaptive strategy routing, outperforming baselines by 5.62-9...

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023. [Online]. Available: https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023. [Online]. Available: https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Towards transparent ai: A survey on explainable large language models,

A. Palikhe, Z. Yu, Z. Wang, and W. Zhang, “Towards transparent ai: A survey on explainable large language models,”arXiv preprint arXiv:2506.21812, 2025

work page arXiv 2025
[4]

Survey of hallucination in natural language generation,

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,”ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023. [Online]. Available: https://dl.acm.org/doi/10.1145/3571730

work page doi:10.1145/3571730 2023
[5]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qinet al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,”ACM Transactions on Information Systems, 2024. [Online]. Available: https://dl.acm.org/doi/10.1145/3703155

work page doi:10.1145/3703155 2024
[6]

Dinov3-powered multi- task foundation model for quantitative remote sensing estimation,

Z. Yu, M. Y . I. Idris, P. Wang, and R. Qureshi, “Dinov3-powered multi- task foundation model for quantitative remote sensing estimation,”AAAI 2026, vol. 40, no. 48, pp. 41 455–41 456, 2026

work page 2026
[7]

Retrieval-Augmented Generation for Large Language Models: A Survey

Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, and H. Wang, “Retrieval-augmented generation for large language models: A survey,”arXiv preprint arXiv:2312.10997, 2023. [Online]. Available: https://arxiv.org/abs/2312.10997

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Removal of hallucination on hallucination: Debate-augmented RAG,

W. Hu, W. Zhang, Y . Jiang, C. J. Zhang, X. Wei, and L. Qing, “Removal of hallucination on hallucination: Debate-augmented RAG,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 15 839–15 853. [Online]. Available: ht...

work page 2025
[9]

Reasoning in computer vi- sion: Taxonomy, models, tasks, and methodologies,

A. Sarkar, M. Y . I. Idris, and Z. Yu, “Reasoning in computer vi- sion: Taxonomy, models, tasks, and methodologies,”arXiv preprint arXiv:2508.10523, 2025

work page arXiv 2025
[10]

Chat-driven text generation and interaction for person retrieval,

Z. Xie, C. Wang, Y . Wang, S. Cai, S. Wang, and T. Jin, “Chat-driven text generation and interaction for person retrieval,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 5259–5270

work page 2025
[11]

Conquer: Context-aware representation with query enhancement for text-based person search,

Z. Xie, “Conquer: Context-aware representation with query enhancement for text-based person search,”arXiv preprint arXiv:2601.18625, 2026

work page arXiv 2026
[12]

Yielding unblemished aesthetics through a unified network for visual imperfections removal in generated images,

Z. Yu and C. S. Chan, “Yielding unblemished aesthetics through a unified network for visual imperfections removal in generated images,” AAAI 2025, vol. 39, no. 9, pp. 9716–9724, 2025

work page 2025
[13]

Qrs-trs: Style transfer-based image-to-image translation for carbon stock estimation in quantitative remote sensing,

Z. Yu, J. Wang, H. Chen, and M. Y . I. Idris, “Qrs-trs: Style transfer-based image-to-image translation for carbon stock estimation in quantitative remote sensing,”IEEE Access, 2025

work page 2025
[14]

Hvd: Human vision- driven video representation learning for text-video retrieval,

Z. Xie, X. Liu, B. Zhang, Y . Lin, S. Cai, and T. Jin, “Hvd: Human vision- driven video representation learning for text-video retrieval,”arXiv preprint arXiv:2601.16155, 2026

work page arXiv 2026
[15]

Delving deeper: Hierarchi- cal visual perception for robust video-text retrieval,

Z. Xie, B. Zhang, Y . Lin, and T. Jin, “Delving deeper: Hierarchi- cal visual perception for robust video-text retrieval,”arXiv preprint arXiv:2601.12768, 2026

work page arXiv 2026
[16]

Interleaving retrieval with chain-of-thought reasoning for knowledge- intensive multi-step questions,

H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal, “Interleaving retrieval with chain-of-thought reasoning for knowledge- intensive multi-step questions,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 10 0...

work page 2023
[17]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy,

Z. Shao, Y . Gong, Y . Shen, M. Huang, N. Duan, and W. Chen, “Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy,” inFindings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 9248–9274. [Onli...

work page 2023
[18]

Retrieval-generation synergy augmented large language models,

Z. Feng, X. Feng, D. Zhao, M. Yang, and B. Qin, “Retrieval-generation synergy augmented large language models,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11 661–11 665. [Online]. Available: https://ieeexplore.ieee.org/document/10448015

work page arXiv 2024
[19]

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi, “Self- rag: Learning to retrieve, generate, and critique through self- reflection,”arXiv preprint arXiv:2310.11511, 2023. [Online]. Available: https://arxiv.org/abs/2310.11511

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Forgetme: Benchmarking the selective forgetting capabilities of generative models,

Z. Yu, M. Y . I. Idris, P. Wang, Y . Xia, and Y . Xiang, “Forgetme: Benchmarking the selective forgetting capabilities of generative models,” EAAI, vol. 161, p. 112087, 2025

work page 2025
[21]

Debate or vote: Which yields better decisions in multi-agent large language models?

H. K. Choi, X. Zhu, and S. Li, “Debate or vote: Which yields better decisions in multi-agent large language models?” inAdvances in Neural Information Processing Systems, 2025

work page 2025
[22]

Spatiotemporal align- ment for remote sensing image recovery via terrain-aware diffusion,

Z. Yu, H. Jiang, P. Wang, Z. Lin, and Y . Xiang, “Spatiotemporal align- ment for remote sensing image recovery via terrain-aware diffusion,” ICASSP 2026, 2026

work page 2026
[23]

Cotextor: Training- free modular multilingual text editing via layered disentanglement and depth-aware fusion,

Z. Yu, M. Y . I. IDRIS, P. Wang, and R. Qureshi, “Cotextor: Training- free modular multilingual text editing via layered disentanglement and depth-aware fusion,” inNeurIPS 2025, 2025

work page 2025
[24]

Retrieval augmented language model pre-training,

K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang, “Retrieval augmented language model pre-training,” inInternational conference on machine learning. PMLR, 2020, pp. 3929–3938. [Online]. Available: https://dl.acm.org/doi/abs/10.5555/3524938.3525306

work page doi:10.5555/3524938.3525306 2020
[25]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020. [Online]. Available: https://arxiv.org/abs/2005.11401

work page internal anchor Pith review Pith/arXiv arXiv 2020
[26]

Leveraging passage retrieval with generative models for open domain question answering,

G. Izacard and E. Grave, “Leveraging passage retrieval with generative models for open domain question answering,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, P. Merlo, J. Tiedemann, and R. Tsarfaty, Eds. Online: Association for Computational Linguistics, Apr. 2021, pp. 874–8...

work page 2021
[27]

Few-shot learning with retrieval augmented language models,

G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, and E. Grave, “Few-shot learning with retrieval augmented language models,” arXiv preprint arXiv:2208.03299, 2022. [Online]. Available: https://arxiv.org/abs/2208.03299

work page arXiv 2022
[28]

REPLUG: Retrieval-augmented black- box language models,

W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, and W.-t. Yih, “REPLUG: Retrieval-augmented black- box language models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), K. Duh, H. Gomez, and S. Bethard, Eds. Me...

work page 2024
[29]

Fusing semantics, observability, reliability and diversity of concept detectors for video search,

X.-Y . Wei and C.-W. Ngo, “Fusing semantics, observability, reliability and diversity of concept detectors for video search,” inProceedings of the 16th ACM international conference on Multimedia, 2008, pp. 81–90. [Online]. Available: https://dl.acm.org/doi/10.1145/1459359.1459371

work page doi:10.1145/1459359.1459371 2008
[30]

Multi-agent large language models for conversational task- solving,

J. Becker, “Multi-agent large language models for conversational task- solving,”arXiv preprint arXiv:2410.22932, 2024. [Online]. Available: https://arxiv.org/abs/2410.22932

work page arXiv 2024
[31]

Finecir: Explicit parsing of fine-grained modification semantics for composed image retrieval,

Z. Li, Z. Fu, Y . Hu, Z. Chen, H. Wen, and L. Nie, “Finecir: Explicit parsing of fine-grained modification semantics for composed image retrieval,”https://arxiv.org/abs/2503.21309, 2025

work page arXiv 2025
[32]

Encoder: Entity mining and modification relation binding for composed image retrieval,

Z. Li, Z. Chen, H. Wen, Z. Fu, Y . Hu, and W. Guan, “Encoder: Entity mining and modification relation binding for composed image retrieval,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 5, 2025, pp. 5101–5109

work page 2025
[33]

Hud: Hierar- chical uncertainty-aware disambiguation network for composed video retrieval,

Z. Chen, Y . Hu, Z. Li, Z. Fu, H. Wen, and W. Guan, “Hud: Hierar- chical uncertainty-aware disambiguation network for composed video retrieval,” inProceedings of the ACM International Conference on Multimedia, 2025, p. 6143–6152

work page 2025
[34]

Intent: Invariance and discrimination-aware noise mitigation for robust composed image retrieval,

Z. Chen, Y . Hu, Z. Fu, Z. Li, J. Huang, Q. Huang, and Y . Wei, “Intent: Invariance and discrimination-aware noise mitigation for robust composed image retrieval,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 25, 2026, pp. 20 463–20 471

work page 2026
[35]

Offset: Segmentation-based focus shift revision for composed image retrieval,

Z. Chen, Y . Hu, Z. Li, Z. Fu, X. Song, and L. Nie, “Offset: Segmentation-based focus shift revision for composed image retrieval,” inProceedings of the ACM International Conference on Multimedia, 2025, p. 6113–6122

work page 2025
[36]

Refine: Composed video retrieval via shared and differential semantics enhance- ment,

Y . Hu, Z. Li, Z. Chen, Q. Huang, Z. Fu, M. Xu, and L. Nie, “Refine: Composed video retrieval via shared and differential semantics enhance- ment,”ACM Transactions on Multimedia Computing, Communications and Applications, 2026

work page 2026
[37]

Active retrieval augmented generation,

Z. Jiang, F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y . Yang, J. Callan, and G. Neubig, “Active retrieval augmented generation,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 7969–7992. [Online]. Avail...

work page 2023
[38]

Natural questions: A benchmark for question answering research,

T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov, “Natural questions: A benchmark for question answering research,”Transactions of the Association for Computational Linguistics, vol. 7, pp....

work page 2019
[39]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension,

M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer, “TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension,” inProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), R. Barzilay and M.-Y . Kan, Eds. Vancouver, Canada: Association for Computational Linguistics, Jul. 20...

work page 2017
[40]

When not to trust language models: Investigating effectiveness of parametric and non-parametric memories,

A. Mallen, A. Asai, V . Zhong, R. Das, D. Khashabi, and H. Hajishirzi, “When not to trust language models: Investigating effectiveness of parametric and non-parametric memories,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canad...

work page 2023
[41]

Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps,

X. Ho, A.-K. Duong Nguyen, S. Sugawara, and A. Aizawa, “Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps,” inProceedings of the 28th International Conference on Computational Linguistics, D. Scott, N. Bel, and C. Zong, Eds. Barcelona, Spain (Online): International Committee on Computational Linguistics, Dec. 2020, pp. 66...

work page 2020
[42]

HotpotQA: A dataset for diverse, explainable multi-hop question answering,

Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning, “HotpotQA: A dataset for diverse, explainable multi-hop question answering,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, Eds. Brussels, Belgium: Association for Computationa...

work page 2018
[43]

Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies,

M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, and J. Berant, “Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 346–361, 2021. [Online]. Available: https://aclanthology.org/2021.tacl-1.21/

work page 2021
[44]

Improving factuality and reasoning in language models through multiagent debate,

Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024. [Online]. Available: https://dl.acm.org/doi/10.5555/3692070.3692537

work page doi:10.5555/3692070.3692537 2024
[45]

Sure: Summarizing retrievals using answer candidates for open-domain qa of llms,

J. Kim, J. Nam, S. Mo, J. Park, S.-W. Lee, M. Seo, J.-W. Ha, and J. Shin, “Sure: Summarizing retrievals using answer candidates for open-domain qa of llms,”arXiv preprint arXiv:2404.13081, 2024. [Online]. Available: https://arxiv.org/abs/2404.13081

work page arXiv 2024
[46]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Flashrag: A modular toolkit for efficient retrieval-augmented generation research,

J. Jin, Y . Zhu, X. Yang, C. Zhang, and Z. Dou, “Flashrag: A modular toolkit for efficient retrieval-augmented generation research,”CoRR, vol. abs/2405.13576, 2024. [Online]. Available: https://arxiv.org/abs/2405.13576

work page arXiv 2024
[48]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F. Wei, “Text embeddings by weakly-supervised contrastive pre- training,”arXiv preprint arXiv:2212.03533, 2022. [Online]. Available: https://arxiv.org/abs/2212.03533

work page internal anchor Pith review Pith/arXiv arXiv 2022