arxiv: 2604.18845 · v1 · submitted 2026-04-20 · 💻 cs.IR

Recognition: unknown

Dual-View Training for Instruction-Following Information Retrieval

Qingcheng Zeng , Puxuan Yu , Aman Mehta , Fuheng Zhao , Rajhans Samdani

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:08 UTC · model grok-4.3

classification 💻 cs.IR

keywords instruction-following retrievaldual-view trainingpolarity reversaldata synthesisFollowIR benchmarkhard negativesembedding models

0 comments

The pith

Presenting the same document pairs under complementary instructions that invert their relevance labels trains retrievers to follow explicit user constraints instead of fixed topical cues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to make information retrieval systems obey specific instructions such as required attributes or exclusions, not just find topically related documents. It does this by using an LLM to create a second instruction for each query and document pair so that a previously relevant document becomes irrelevant and vice versa. Training on both views of the same pair pushes the model to re-evaluate candidates according to the current instruction rather than relying on unchanging semantic signals. A 305-million-parameter encoder trained this way improves by 45 percent on the FollowIR benchmark and beats larger general-purpose embedding models. The results also indicate that instruction-focused data and ordinary data diversity each contribute distinct benefits that work together.

Core claim

The paper claims that a dual-view data synthesis strategy based on polarity reversal improves instruction-following retrieval. Given a query, a document relevant under the instruction, and a hard negative that matches the query but violates it, an LLM is prompted to produce a complementary instruction under which the two documents swap relevance labels. Presenting the identical document pair under both instructions forces the retriever to attend to the instruction when judging relevance rather than depending on fixed topical cues. On a 305M-parameter encoder this yields a 45 percent gain on the FollowIR benchmark and surpasses general-purpose models of similar or larger size, while matched-b

What carries the argument

Dual-view data synthesis via polarity reversal, in which an LLM generates a complementary instruction that swaps the relevance labels of a relevant document and its hard negative for the same query.

If this is right

The retriever learns to distinguish documents that match the topic but violate an explicit instruction.
Instruction sensitivity rises while general retrieval quality is preserved through complementary data diversity.
A 305M-parameter model can outperform larger general-purpose embedding models on instruction-following tasks.
Targeted synthetic data can address specific weaknesses in existing retrievers without requiring new architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same polarity-reversal idea could be tested on other conditional retrieval settings such as personalized or multi-turn search.
If the method scales, it suggests LLM-generated data can create fine-grained training signals for many retrieval capabilities beyond instructions.
One could measure whether the performance gain holds when the base encoder is replaced by a decoder-only model or when the volume of synthetic pairs is increased.

Load-bearing premise

An LLM can reliably produce complementary instructions that correctly invert the relevance labels for the document pairs without introducing substantial noise or errors.

What would settle it

Training a retriever on these polarity-reversed pairs yields no improvement on the FollowIR benchmark relative to standard hard-negative training, or manual inspection reveals that a large fraction of the generated complementary instructions assign the wrong relevance labels to the documents.

Figures

Figures reproduced from arXiv: 2604.18845 by Aman Mehta, Fuheng Zhao, Puxuan Yu, Qingcheng Zeng, Rajhans Samdani.

read the original abstract

Instruction-following information retrieval (IF-IR) studies retrieval systems that must not only find documents relevant to a query, but also obey explicit user constraints such as required attributes, exclusions, or output preferences. However, most retrievers are trained primarily for semantic relevance and often fail to distinguish documents that match the topic from those that satisfy the instruction. We propose a dual-view data synthesis strategy based on polarity reversal: given a query, a document that is relevant under the instruction, and a hard negative that matches the query but violates the instruction, we prompt an LLM to generate a complementary instruction under which the two documents swap relevance labels. By presenting the same document pair under complementary instructions that invert their relevance labels, the training signal forces the retriever to reconsider the same candidate set through the instruction, rather than relying on fixed topical cues. On a 305M-parameter encoder, our method improves performance on the FollowIR benchmark by 45%, surpassing general-purpose embedding models of comparable or larger scale. Through head-to-head comparisons at matched data budgets, we further show that data diversity and instruction supervision play complementary roles: the former preserves general retrieval quality, while the latter improves instruction sensitivity. These results highlight the value of targeted data synthesis for building retrieval systems that are both broadly capable and instruction-aware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The dual-view polarity reversal adds a clean way to inject instruction sensitivity into retrieval via LLM-synthesized label flips, but the 45% FollowIR gain lacks any reported check on whether those flips actually work.

read the letter

The main thing to know is that this paper shows a practical data synthesis method for making embedding models follow explicit instructions better while keeping general retrieval quality intact. They take query-document pairs with a relevant doc and a hard negative, then prompt an LLM to write a complementary instruction that swaps which document is relevant. Training on both views of the same pair pushes the model to use the instruction rather than just topical similarity. On a 305M encoder this yields a 45% lift on FollowIR and beats larger general-purpose models. The matched-budget ablations are a real plus because they separate the effect of instruction supervision from plain data diversity. Diversity preserves broad capability; the polarity signal improves instruction adherence. That separation is useful and not often shown in this area. The soft spot is exactly the one the stress test flags. The whole claim depends on the LLM producing complementary instructions that correctly invert the relevance labels at high accuracy. The abstract gives no inversion rates, no human validation of samples, and no breakdown of failure cases. If the generated instructions are noisy or wrong a lot of the time, the observed gains could come from extra data variety or regularization instead of the intended training signal. Without that evidence the causal link stays unproven. This paper is for IR researchers and practitioners who need controllable retrieval, such as systems that must respect attribute filters or output preferences. Anyone working on FollowIR-style benchmarks or fine-tuning embedding models for applied search would get concrete value from the synthesis recipe and the ablation design. It has enough novelty in the data-generation step and enough empirical results to deserve a serious referee, even though reviewers will almost certainly ask for the missing validation on the synthetic pairs. I would send it to peer review.

Referee Report

2 major / 1 minor

Summary. The paper proposes a dual-view data synthesis strategy for instruction-following information retrieval (IF-IR). Given a query, a relevant document, and a hard negative, an LLM is prompted to generate a complementary instruction that inverts their relevance labels. The resulting training pairs are used to train a retriever (305M-parameter encoder) to attend to instructions rather than fixed topical cues. The central empirical result is a 45% improvement on the FollowIR benchmark, outperforming general-purpose embedding models of comparable or larger scale, with matched-budget ablations showing complementary contributions from data diversity and instruction supervision.

Significance. If the results hold, the work demonstrates a practical, targeted data-synthesis technique that can improve instruction sensitivity while preserving general retrieval quality. The matched-budget comparisons are a clear strength, as they help isolate the contribution of polarity reversal from mere increases in training volume. This could be valuable for building retrievers that reliably obey explicit user constraints such as attribute requirements or exclusions.

major comments (2)

[Abstract] Abstract: The reported 45% gain on FollowIR is presented without statistical significance tests, precise baseline model names and scores, or the exact volume of synthetic data generated, which are required to assess whether the improvement is robust and attributable to the proposed method rather than experimental confounds.
[Data synthesis and training procedure] Data synthesis and training procedure: No quantitative error analysis, human validation, or failure-mode breakdown is provided for the LLM-generated complementary instructions and their label-inversion accuracy. This is load-bearing for the central claim, because if inversion accuracy is substantially below ~90%, the observed gains could result from incidental data diversity or regularization rather than the intended dual-view polarity-reversal signal, undermining the causal interpretation supported by the matched-budget ablations.

minor comments (1)

[Abstract] The abstract would benefit from a brief parenthetical note on the FollowIR evaluation metric (e.g., nDCG or recall) to allow readers to interpret the 45% figure without consulting the full experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of clarity and validation that we have addressed through targeted revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The reported 45% gain on FollowIR is presented without statistical significance tests, precise baseline model names and scores, or the exact volume of synthetic data generated, which are required to assess whether the improvement is robust and attributable to the proposed method rather than experimental confounds.

Authors: We agree that the abstract should enable immediate assessment of result robustness. We have revised the abstract to name the primary baselines and report their FollowIR scores, to state the volume of synthetic data generated, and to note that statistical significance of the gains (via paired tests) is established in the results section. These details were already reported in the body but are now summarized in the abstract for accessibility. revision: yes
Referee: [Data synthesis and training procedure] Data synthesis and training procedure: No quantitative error analysis, human validation, or failure-mode breakdown is provided for the LLM-generated complementary instructions and their label-inversion accuracy. This is load-bearing for the central claim, because if inversion accuracy is substantially below ~90%, the observed gains could result from incidental data diversity or regularization rather than the intended dual-view polarity-reversal signal, undermining the causal interpretation supported by the matched-budget ablations.

Authors: We concur that direct validation of inversion accuracy is necessary to support the causal role of polarity reversal. We have added a dedicated subsection with quantitative human validation on a random sample of generated pairs, including measured inversion accuracy and a categorized failure-mode breakdown. This new analysis confirms that errors are infrequent enough to preserve the intended training signal and is consistent with the matched-budget results showing complementary benefits from instruction supervision. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical data synthesis and training evaluated on external benchmark

full rationale

The paper presents a data-generation procedure (LLM-prompted polarity reversal to create complementary instruction pairs) followed by standard contrastive training of an encoder, with results measured on the held-out FollowIR benchmark. No equations, fitted parameters, or derivations are shown that reduce the claimed 45% gain to a tautology or self-referential definition. The central claim is an empirical outcome from training on generated data and testing externally; it does not collapse by construction to its inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are identified in the manuscript description.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM-generated flipped instructions preserve accurate relevance inversion; no free parameters or invented entities are introduced beyond standard contrastive training.

axioms (1)

domain assumption LLM-generated complementary instructions accurately invert relevance labels for the chosen document pairs
The training signal depends entirely on the correctness of these synthetic labels.

pith-pipeline@v0.9.0 · 5537 in / 909 out tokens · 42080 ms · 2026-05-10T03:08:50.298998+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 5 canonical work pages

[1]

Smith, Luke Zettlemoyer, and Tao Yu

Su, Hongjin and Shi, Weijia and Kasai, Jungo and Wang, Yizhong and Hu, Yushi and Ostendorf, Mari and Yih, Wen-tau and Smith, Noah A. and Zettlemoyer, Luke and Yu, Tao. One Embedder, Any Task: Instruction-Finetuned Text Embeddings. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.71

work page doi:10.18653/v1/2023.findings-acl.71 2023
[2]

F ollow IR : Evaluating and teaching information retrieval models to follow instructions

Weller, Orion and Chang, Benjamin and MacAvaney, Sean and Lo, Kyle and Cohan, Arman and Van Durme, Benjamin and Lawrie, Dawn and Soldaini, Luca. F ollow IR : Evaluating and Teaching Information Retrieval Models to Follow Instructions. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics...

work page doi:10.18653/v1/2025.naacl-long.597 2025
[3]

Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models , url =

Weller, Orion and Van Durme, Ben and Lawrie, Dawn and Paranjape, Ashwin and Zhang, Yuhao and Hessel, Jack , booktitle =. Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models , url =
[4]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[5]

mGTE : Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval

Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and Zhang, Meishan and Li, Wenjie and Zhang, Min. mGTE : Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval. Proceedings of the 2024 Conference on Empiri...

work page doi:10.18653/v1/2024.emnlp-industry.103 2024
[6]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , booktitle =

Chen, Jianlyu and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng. M 3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.137

work page doi:10.18653/v1/2024.findings-acl.137 2024
[7]

C 4 C orpus: Multilingual Web-size Corpus with Free License

Habernal, Ivan and Zayed, Omnia and Gurevych, Iryna. C 4 C orpus: Multilingual Web-size Corpus with Free License. Proceedings of the Tenth International Conference on Language Resources and Evaluation ( LREC '16). 2016

2016
[8]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =
[9]

2019 , eprint=

Representation Learning with Contrastive Predictive Coding , author=. 2019 , eprint=

2019
[10]

Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models , url =

Zhou, Jianqun and Zheng, Yuanlei and Chen, Wei and Zheng, Qianqian and Zeyuan, Shang and Zhang, Wei and Meng, Rui and Shen, Xiaoyu , booktitle =. Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models , url =
[11]

MAIR : A massive benchmark for evaluating instructed retrieval

Sun, Weiwei and Shi, Zhengliang and Long, Wu Jiu and Yan, Lingyong and Ma, Xinyu and Liu, Yiding and Cao, Min and Yin, Dawei and Ren, Zhaochun. MAIR : A Massive Benchmark for Evaluating Instructed Retrieval. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.778

work page doi:10.18653/v1/2024.emnlp-main.778 2024
[12]

2023 , eprint=

Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=

2023
[13]

2024 , eprint=

INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models , author=. 2024 , eprint=

2024
[14]

2025 , eprint=

EmbeddingGemma: Powerful and Lightweight Text Representations , author=. 2025 , eprint=

2025
[15]

2025 , eprint=

Towards Better Instruction Following Retrieval Models , author=. 2025 , eprint=

2025
[16]

2025 , eprint=

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. 2025 , eprint=

2025