{"total":16,"items":[{"citing_arxiv_id":"2606.01070","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Test-Time Training for Zero-Resource Dense Retrieval Reranking","primary_cat":"cs.IR","submitted_at":"2026-05-31T07:26:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DART adapts a scoring matrix at inference time via gradient updates on pseudo-labels from top/bottom documents to gain +2.1% mean NDCG@10 on six BEIR benchmarks with under 10ms added latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26352","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents","primary_cat":"cs.CL","submitted_at":"2026-05-25T21:56:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RICE-PO is a policy optimization framework that converts retrieval interactions into credit signals for latent reasoning steps in agents by selecting high-uncertainty actions as anchors and propagating credit based on influence strength and residual stability, outperforming baselines on BRIGHT and B","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22478","ref_index":60,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeliCIR: Deliberative Test-Time Evolutionary Hierarchical Multi-Agents for Composed Image Retrieval","primary_cat":"cs.CV","submitted_at":"2026-05-21T13:36:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes PDF, a hierarchical multi-agent Perception-to-Deliberation Framework that adds experience self-evolution and test-time scaling to composed image retrieval, claiming SOTA on CIRR, CIRCO, and FashionIQ.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20683","ref_index":38,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Layer-wise Token Compression for Efficient Document Reranking","primary_cat":"cs.IR","submitted_at":"2026-05-20T03:52:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13137","ref_index":24,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving","primary_cat":"cs.IR","submitted_at":"2026-05-13T08:04:57+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LeanSearch v2 recovers 46.1% of ground-truth premise groups for research-level Lean 4 theorems within 10 candidates and raises fixed-loop proof success to 20%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04018","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems","primary_cat":"cs.CL","submitted_at":"2026-05-05T17:42:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01399","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-02T11:43:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00400","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FollowTable: A Benchmark for Instruction-Following Table Retrieval","primary_cat":"cs.IR","submitted_at":"2026-05-01T04:42:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond topical similarity.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"E5-Mistral (7B) [44] 76.2 58.6 2.0 6.2 87.2 69.3 1.9 9.0 87.4 68.0 2.2 14.1 74.3 57.1 -0.6 17.1 81.3 63.3 1.4 11.6 GritLM (7B) [26] 76.3 58.0 2.9 6.7 86.0 66.7 5.8 10.7 85.5 69.5 1.8 14.7 77.360.3 2.1 19.3 81.3 63.6 3.1 12.9 OpenAI v3 Large 75.7 54.6 2.4 4.3 84.1 60.2 3.0 7.8 87.0 66.2 0.4 11.7 74.4 52.9 -0.3 13.7 80.3 58.5 1.4 9.4 Qwen3-Emb (0.6B) [54]75.4 59.6 4.2 9.6 80.4 65.7 5.4 15.5 84.2 65.9 4.6 16.4 72.0 60.1 0.3 21.1 78.0 62.8 3.6 15.6 Qwen3-Emb (8B) [54] 77.1 66.8 7.5 14.5 88.375.4 7.1 17.6 88.8 77.614.2 24.7 76.2 70.711.531.8 82.672.6 10.1 22.2 Promptriever (7B) [47]75.169.1 8.2 18.3 82.376.4 10.6 23.7 83.7 76.515.1 28.7 69.971.28.034.2 77.773.3 10.5 26.2 Table 5: Performance evaluation of re-ranking models on"},{"citing_arxiv_id":"2605.00063","ref_index":76,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Reasoning-Intensive Retrieval: Progress and Challenges","primary_cat":"cs.IR","submitted_at":"2026-04-30T08:35:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A survey that categorizes RIR benchmarks by domain and modality, proposes a taxonomy for integrating reasoning into retrieval pipelines, and outlines key challenges.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27577","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reproducing Adaptive Reranking for Reasoning-Intensive IR","primary_cat":"cs.IR","submitted_at":"2026-04-30T08:30:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Reproducing GAR on BRIGHT shows it boosts reasoning-intensive retrieval effectiveness with low overhead when the reranker's signal quality is strong.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Thinking free ranker TFRank [9] further enhances the training par- adigm by proposing an efficient reasoning-based ranker leveraging reasoning models. Their training recipe entails a multi-task learn- ing strategy including fine-grained relevance scores for pointwise, listwise, and pairwise paradigms distilled from a Deepseek-R1 rea- soning model. REARank [47] extends reasoning language models to listwise ranking, with explicit reasoning that is incentivized using reinforcement learning, under data-scarce scenarios with a data aug- mentation strategy, outperforming larger scale LLMs on reasoning- intensive document ranking tasks. Rank-R1 [49] trains an LLM as a setwise reranker using RL. It simplifies the task of finding the most"},{"citing_arxiv_id":"2605.18767","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DualView: Adaptive Local-Global Fusion for Multi-Hop Document Reranking","primary_cat":"cs.IR","submitted_at":"2026-04-13T08:56:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DualView fuses local cross-attention and global context aggregation via adaptive gating to rerank fixed candidate sets for multi-hop QA, reporting 99.4% Top-4 Recall on MuSiQue at 4 ms latency while beating larger cross-encoders.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11025","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images","primary_cat":"cs.CV","submitted_at":"2026-04-13T05:49:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolution and multimodal reasoning tasks.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, et al. 2026. Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception. arXiv preprint arXiv:2602.11858(2026). [34] Orion Weller, Kathryn Ricci, Eugene Yang, Andrew Yates, Dawn Lawrie, and Ben- jamin Van Durme. 2025. Rank1: Test-time compute for reranking in information retrieval.arXiv preprint arXiv:2502.18418(2025). [35] Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. 2025. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965 (2025). [36] Penghao Wu and Saining Xie. 2024. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision"},{"citing_arxiv_id":"2604.07220","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval","primary_cat":"cs.IR","submitted_at":"2026-04-08T15:41:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and +14.1 over the best multimodal baseline.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"or iteratively expanded queries [17, 21] to improve the re- trieval of reasoning-intensive content. Re-Invoke [8] ap- plies LLMs to unsupervised tool retrieval by enriching tool documents offline and extracting user intent at inference time. RankGPT [32] demonstrated that LLMs can directly rerank retrieved passages through sliding window prompt- ing, while Rank1 [37] and RankR1 [43] further improved reranking via reasoning optimized LLMs. 2 Figure 1. An example where standard multimodal retrievers fail to identify the relevant document because the query image (a circuit diagram) contains critical visual cues that text-only or embedding-based matching cannot capture. HIVE generates a compensatory query that explicitly articulates these visual gaps, enabling successful retrieval."},{"citing_arxiv_id":"2604.07079","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL","primary_cat":"cs.IR","submitted_at":"2026-04-08T13:35:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"rating query intent before retrieval consistently outperforms embedding-only approaches. 2.5. LLM-Based Reranking Reranking retrieved candidates using LLMs has emerged as a powerful strategy for improving retrieval preci- sion. RankGPT [31] demonstrated that LLMs can directly rerank retrieved passages through sliding window prompt- ing, significantly outperforming embedding-based ranking. Rank1 [34] and RankR1 [39] further improved rerank- ing via reasoning-optimized LLMs trained with reinforce- ment learning to produce step-by-step relevance judgments. These approaches establish that reasoning-based reranking is consistently superior to similarity-based ranking for com- plex queries. 3. Method 3.1. Problem Formulation We address themultimodal-to-textretrieval task."},{"citing_arxiv_id":"2603.08819","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage","primary_cat":"cs.IR","submitted_at":"2026-03-09T18:20:20+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.03487","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ProRank: Prompt Warmup via Reinforcement Learning for Small Language Models Reranking","primary_cat":"cs.IR","submitted_at":"2025-06-04T02:00:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ProRank uses RL-based prompt warmup and fine-grained scoring to train small language models that surpass LLM rerankers on BEIR.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}