Recognition: 2 theorem links
· Lean TheoremF-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking
Pith reviewed 2026-05-14 19:48 UTC · model grok-4.3
The pith
F-GRPO lets one LLM jointly generate candidates and rank them by factorizing policy optimization into separate phases with distinct advantages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By factorizing the policy into candidate generation and ranking while sharing a single LLM backbone, and by applying separate group-relative advantages to each phase inside a two-phase sequence-level objective, the model can optimize both the selection of relevant candidates and their correct ordering against downstream utility signals in a single end-to-end rollout.
What carries the argument
Factorized Group-Relative Policy Optimization (F-GRPO), which decomposes the sequence-level objective into generation and ranking phases, supplies each with its own group-relative advantage, and combines an order-invariant coverage reward with a position-aware utility reward.
If this is right
- Top-ranked performance rises over both standard GRPO and separately trained generation-then-ranking pipelines on sequential recommendation and multi-hop QA tasks.
- The method outperforms supervised fine-tuning baselines while staying competitive with strong zero-shot rerankers.
- No changes to model architecture or inference procedure are required after training.
- End-to-end optimization aligns generation and ranking directly with final utility rather than with intermediate retrieval metrics.
Where Pith is reading between the lines
- The same phase-factorization pattern could apply to other composite generative tasks such as multi-step planning followed by execution.
- Because the backbone remains unchanged, the approach may scale to larger models without doubling parameter count at deployment.
- If the two-phase objective proves stable, similar factorization might reduce the need for hand-crafted multi-stage pipelines in retrieval-augmented generation.
Load-bearing premise
The credit assignment problem between missing good candidates and mis-ordering them can be solved simply by giving each phase its own group-relative advantage while keeping the LLM backbone shared.
What would settle it
A controlled experiment in which the unified model is forced to generate the same high-coverage candidate set as a strong decoupled baseline yet still produces lower final utility after ranking, or the reverse case where ranking quality stays fixed but generation quality drops.
Figures
read the original abstract
Traditional retrieval pipelines optimize utility through stages of candidate retrieval and reranking, where ranking operates over a predefined candidate set. Large Language Models (LLMs) broaden this into a generative process: given a candidate pool, an LLM can generate a subset and order it within a single autoregressive pass. However, this flexibility introduces a new optimization challenge: the model must search a combinatorial output space while receiving utility feedback only after the full ranked list is generated. Because this feedback is defined over the completed sequence, it cannot distinguish whether a poor result arises from failing to generate a relevant subset or from failing to rank that subset correctly. This credit assignment gap makes end-to-end optimization unstable and sample-inefficient. Existing systems often address this by separating candidate generation from ranking. However, such decoupling remains misaligned with downstream utility because ranking is limited by the candidate set it receives. To bridge this gap, we propose a unified framework that performs both within a single autoregressive rollout and optimizes them end-to-end via factorized group-relative policy optimization (F-GRPO). Our framework factorizes the policy into candidate generation and ranking while sharing a single LLM backbone, and jointly trains them with an order-invariant coverage reward and a position-aware utility reward. To address the resulting phase-specific credit assignment problem, we use separate group-relative advantages for generation and ranking within a two-phase sequence-level objective. Across sequential recommendation and multi-hop question answering benchmarks, F-GRPO improves top-ranked performance over GRPO and decoupled baselines, outperforms supervised alternatives, and remains competitive with strong zero-shot rerankers, with no architectural changes at inference time.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces F-GRPO, a factorized extension of group-relative policy optimization that unifies candidate generation and ranking inside a single autoregressive LLM rollout. It factorizes the policy into generation and ranking components that share one backbone, optimizes them jointly with an order-invariant coverage reward and a position-aware utility reward, and resolves the resulting credit-assignment problem by applying separate group-relative advantages inside a two-phase sequence-level objective. Experiments on sequential recommendation and multi-hop QA benchmarks report that F-GRPO improves top-ranked performance over GRPO and decoupled baselines, outperforms supervised alternatives, and remains competitive with strong zero-shot rerankers without any inference-time architectural changes.
Significance. If the factorization and phase-specific advantages can be shown to remain unbiased under the shared autoregressive coupling, the method would offer a practical route to end-to-end optimization of generative ranking pipelines that currently rely on staged retrieval-plus-reranking. The absence of inference-time overhead and the reported gains over both GRPO and supervised baselines would make the framework relevant to recommendation and retrieval-augmented generation systems.
major comments (2)
- [Method / two-phase objective] The central technical claim—that separate group-relative advantages for the generation and ranking phases remain unbiased inside a single contiguous autoregressive sequence—requires an explicit derivation or proof. The abstract states that the two-phase objective addresses the credit-assignment gap, yet no argument is supplied showing that gradients from the position-aware utility reward do not leak back through the shared backbone into generation decisions when the phase boundary is realized only by formatting conventions or token masking.
- [Experiments] The experimental section must include an ablation that isolates the effect of the factorization itself (i.e., F-GRPO versus a non-factorized GRPO baseline that uses the same two-phase formatting). Without this ablation, it is impossible to attribute the reported gains to the proposed advantage separation rather than to the joint training or the choice of rewards.
minor comments (2)
- [Experiments] The abstract refers to “sequential recommendation and multi-hop question answering benchmarks” without naming the concrete datasets or reporting the number of runs and statistical significance tests; these details should be added to the experimental section.
- [Method] Notation for the generation-phase and ranking-phase advantages should be introduced with explicit equations rather than descriptive text, to make the two-phase objective reproducible.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. We appreciate the recognition of F-GRPO's potential for end-to-end optimization of generative ranking pipelines without inference-time overhead. We address each major comment below and will revise the manuscript to strengthen the technical presentation and experimental validation.
read point-by-point responses
-
Referee: [Method / two-phase objective] The central technical claim—that separate group-relative advantages for the generation and ranking phases remain unbiased inside a single contiguous autoregressive sequence—requires an explicit derivation or proof. The abstract states that the two-phase objective addresses the credit-assignment gap, yet no argument is supplied showing that gradients from the position-aware utility reward do not leak back through the shared backbone into generation decisions when the phase boundary is realized only by formatting conventions or token masking.
Authors: We acknowledge that the manuscript would benefit from an explicit derivation showing that the phase-specific group-relative advantages remain unbiased under the shared autoregressive backbone. In the revised version we will add a dedicated subsection deriving the policy gradient for the two-phase objective. The derivation will demonstrate that, by computing separate advantages over masked phase-specific tokens and applying the group-relative baseline within each phase, gradients from the position-aware utility reward are isolated to the ranking tokens and do not propagate back to generation decisions. revision: yes
-
Referee: [Experiments] The experimental section must include an ablation that isolates the effect of the factorization itself (i.e., F-GRPO versus a non-factorized GRPO baseline that uses the same two-phase formatting). Without this ablation, it is impossible to attribute the reported gains to the proposed advantage separation rather than to the joint training or the choice of rewards.
Authors: We agree that an ablation isolating the factorization and advantage separation is necessary. We will add this experiment in the revised manuscript: a non-factorized GRPO baseline that uses identical two-phase formatting and the same order-invariant coverage plus position-aware utility rewards, but applies a single group-relative advantage over the entire sequence. Performance differences versus F-GRPO will be reported on both the sequential recommendation and multi-hop QA benchmarks to attribute gains specifically to the phase-specific advantages. revision: yes
Circularity Check
No significant circularity; F-GRPO defined as new factorized objective from credit-assignment setup
full rationale
The paper introduces F-GRPO as an explicit new optimization framework that factorizes a single autoregressive policy into generation and ranking phases, using separate group-relative advantages inside a two-phase sequence-level objective along with order-invariant coverage and position-aware utility rewards. This construction is presented directly from the stated credit-assignment gap in the abstract, without any quoted equations or self-citations that reduce the claimed separation or performance gains back to fitted inputs or prior results by definition. The central claim therefore remains an independent modeling choice rather than a renaming or self-referential fit.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use separate group-relative advantages for generation and ranking within a two-phase sequence-level objective... L(θ) = L_slate + λ L_rank + β_KL D_KL
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Factorized credit assignment... phase-specific gradient weighting
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ellis, Brian Whitman, and Paul Lamere
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval ( ISMIR 2011) , 2011
work page 2011
-
[2]
Autoregressive search engines: Generating substrings as document identifiers
Michele Bevilacqua, Giuseppe Ottaviano, Patrick Lewis, Scott Yih, Sebastian Riedel, and Fabio Petroni. Autoregressive search engines: Generating substrings as document identifiers. Advances in Neural Information Processing Systems, 35: 0 31668--31683, 2022
work page 2022
-
[3]
Generative slate recommendation with reinforcement learning
Romain Deffayet, Thibaut Thonet, Jean-Michel Renders, and Maarten de Rijke. Generative slate recommendation with reinforcement learning. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, WSDM '23, pp.\ 580–588, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394079. doi:10.1145/3539597.35...
-
[4]
Chang, Claire Cardie, Kianté Brantley, and Thorsten Joachim
Ge Gao, Jonathan D. Chang, Claire Cardie, Kianté Brantley, and Thorsten Joachim. Policy-gradient training of language models for ranking, 2024. URL https://arxiv.org/abs/2310.04407
-
[5]
R e2 G : Retrieve, rerank, generate
Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Ankita Naik, Pengshan Cai, and Alfio Gliozzo. R e2 G : Retrieve, rerank, generate. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Languag...
-
[6]
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025
work page 2025
-
[7]
Towards two-stage counterfactual learning to rank
Shashank Gupta, Yiming Liao, and Maarten de Rijke. Towards two-stage counterfactual learning to rank. In Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR), ICTIR '25, pp.\ 177–182, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400718618. doi:10.1145/3731...
-
[8]
F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5 0 (4), December 2015. ISSN 2160-6455. doi:10.1145/2827872. URL https://doi.org/10.1145/2827872
-
[9]
Session-based Recommendations with Recurrent Neural Networks
Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommendations with recurrent neural networks, 2016. URL https://arxiv.org/abs/1511.06939
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[10]
Towards universal sequence representation learning for recommender systems
Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. Towards universal sequence representation learning for recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD '22, pp.\ 585–593, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393850. doi:10...
-
[11]
Large language models are zero-shot rankers for recommender systems
Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. Large language models are zero-shot rankers for recommender systems. In Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24–28, 2024, Proceedings, Part II, pp.\ 364–381, Berlin, Heidelberg, 202...
-
[12]
Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 a . URL https://openreview.net/forum?id=NFM8F5cV0V
work page 2025
-
[13]
Interactive visualization recommendation with hier-sucb
Songwen Hu, Ryan A Rossi, Tong Yu, Junda Wu, Handong Zhao, Sungchul Kim, and Shuai Li. Interactive visualization recommendation with hier-sucb. In Proceedings of the ACM on Web Conference 2025, pp.\ 313--321, 2025 b
work page 2025
-
[14]
Chengkai Huang, Hongtao Huang, Tong Yu, Kaige Xie, Junda Wu, Shuai Zhang, Julian Mcauley, Dietmar Jannach, and Lina Yao. A survey of foundation model-powered recommender systems: From feature-based, generative to agentic paradigms. arXiv preprint arXiv:2504.16420, 2025 a
-
[15]
Towards agentic recommender systems in the era of multimodal large language models
Chengkai Huang, Junda Wu, Yu Xia, Zixu Yu, Ruhan Wang, Tong Yu, Ruiyi Zhang, Ryan A Rossi, Branislav Kveton, Dongruo Zhou, et al. Towards agentic recommender systems in the era of multimodal large language models. arXiv preprint arXiv:2503.16734, 2025 b
-
[16]
Pluralistic off-policy evaluation and alignment
Chengkai Huang, Junda Wu, Zhouhang Xie, Yu Xia, Rui Wang, Tong Yu, Subrata Mitra, Julian McAuley, and Lina Yao. Pluralistic off-policy evaluation and alignment. arXiv preprint arXiv:2509.19333, 2025 c
-
[17]
Listwise preference diffusion optimization for user behavior trajectories prediction
Hongtao Huang, Chengkai Huang, Junda Wu, Tong Yu, Julian McAuley, and Lina Yao. Listwise preference diffusion optimization for user behavior trajectories prediction. Advances in Neural Information Processing Systems, 38: 0 159383--159408, 2026 a
work page 2026
-
[18]
Image difference captioning via adversarial preference optimization
Zihan Huang, Junda Wu, Rohan Surana, Tong Yu, David Arbour, Ritwik Sinha, and Julian McAuley. Image difference captioning via adversarial preference optimization. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 33746--33758, 2025 d
work page 2025
-
[19]
Evaluation on entity matching in recommender systems
Zihan Huang, Rohan Surana, Zhouhang Xie, Junda Wu, Yu Xia, and Julian McAuley. Evaluation on entity matching in recommender systems. arXiv preprint arXiv:2601.17218, 2026 b
-
[20]
Active learning for direct preference optimization
Branislav Kveton, Xintong Li, Julian McAuley, Ryan Rossi, Jingbo Shang, Junda Wu, and Tong Yu. Active learning for direct preference optimization. arXiv preprint arXiv:2503.01076, 2025
-
[21]
A personalized conversational benchmark: Towards simulating personalized conversations
Li Li, Peilin Cai, Ryan A Rossi, Franck Dernoncourt, Branislav Kveton, Junda Wu, Tong Yu, Linxin Song, Tiankai Yang, Yuehan Qin, et al. A personalized conversational benchmark: Towards simulating personalized conversations. arXiv preprint arXiv:2505.14106, 2025 a
-
[22]
Importance sampling for multi-negative multimodal direct preference optimization
Xintong Li, Chuhan Wang, Junda Wu, Rohan Surana, Tong Yu, Julian McAuley, and Jingbo Shang. Importance sampling for multi-negative multimodal direct preference optimization. arXiv preprint arXiv:2509.25717, 2025 b
-
[23]
Ract: Ranking-aware chain-of-thought optimization for llms
Haowei Liu, Xuyang Wu, Guohao Sun, Hsin-Tai Wu, Zhiqiang Tao, and Yi Fang. Ract: Ranking-aware chain-of-thought optimization for llms. In Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, SIGIR-AP 2025, pp.\ 178–188, New York, NY, USA, 2025 a . Association for...
-
[24]
Learning to rank for information retrieval
Tie-Yan Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval , 3 0 (3): 0 225--331, 2009
work page 2009
-
[25]
Understanding r1-zero-like training: A critical perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. In Second Conference on Language Modeling, 2025 b . URL https://openreview.net/forum?id=5PAF7PAY2Y
work page 2025
-
[26]
Recranker: Instruction tuning large language model as ranker for top-k recommendation
Sichun Luo, Bowei He, Haohan Zhao, Wei Shao, Yanlin Qi, Yinya Huang, Aojun Zhou, Yuxuan Yao, Zongpeng Li, Yuanzhang Xiao, Mingjie Zhan, and Linqi Song. Recranker: Instruction tuning large language model as ranker for top-k recommendation. ACM Trans. Inf. Syst., 43 0 (5), July 2025. ISSN 1046-8188. doi:10.1145/3705728. URL https://doi.org/10.1145/3705728
-
[27]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax. Minimax-m1: Scaling test-time compute efficiently with lightning attention, 2025. URL https://arxiv.org/abs/2506.13585
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Ws-grpo: Weakly-supervised group-relative policy optimization for rollout-efficient reasoning
Gagan Mundada, Zihan Huang, Rohan Surana, Sheldon Yu, Jennifer Yuntong Zhang, Xintong Li, Tong Yu, Lina Yao, Jingbo Shang, Julian McAuley, et al. Ws-grpo: Weakly-supervised group-relative policy optimization for rollout-efficient reasoning. arXiv preprint arXiv:2602.17025, 2026
-
[29]
Large language models for conversational user simulation: A comprehensive survey
Bo Ni, Leyao Wang, Yu Wang, Branislav Kveton, Franck Dernoncourt, Yu Xia, Hongjie Chen, Reuben Leura, Samyadeep Basu, Subhojyoti Mukherjee, et al. Large language models for conversational user simulation: A comprehensive survey. 2025
work page 2025
-
[30]
A survey on llm-based conversational user simulation
Bo Ni, Yu Wang, Leyao Wang, Branislav Kveton, Franck Dernoncourt, Yu Xia, Hongjie Chen, Reuben Luera, Samyadeep Basu, Subhojyoti Mukherjee, et al. A survey on llm-based conversational user simulation. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 4266--4301, 2026
work page 2026
-
[31]
Document ranking with a pretrained sequence-to-sequence model
Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. Document ranking with a pretrained sequence-to-sequence model. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp.\ 708--718, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-em...
-
[32]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022
work page 2022
-
[33]
Yunsheng Pang, Zijian Liu, Yudong Li, Shaojie Zhu, Zijian Luo, Chenyun Yu, Sikai Wu, Shichen Shen, Cong Xu, Bin Wang, Kai Jiang, Hongyong Yu, Chengxiang Zhuo, and Zang Li. Higr: Efficient generative slate recommendation via hierarchical planning and multi-objective preference alignment, 2026. URL https://arxiv.org/abs/2512.24787
-
[34]
Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models, 2021. URL https://arxiv.org/abs/2101.05667
-
[35]
Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze!, 2023. URL https://arxiv.org/abs/2312.02724
-
[36]
Qwen3.5 : Towards native multimodal agents, February 2026
Qwen Team . Qwen3.5 : Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5
work page 2026
-
[37]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: your language model is secretly a reward model. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, Red Hook, NY, USA, 2023. Curran Associates Inc
work page 2023
-
[38]
The probabilistic relevance framework: Bm25 and beyond
Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3 0 (4): 0 333–389, April 2009. ISSN 1554-0669. doi:10.1561/1500000019. URL https://doi.org/10.1561/1500000019
-
[39]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[40]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Rankllm: A python package for reranking with llms
Sahel Sharifymoghaddam, Ronak Pradeep, Andre Slavescu, Ryan Nguyen, Andrew Xu, Zijian Chen, Yilin Zhang, Yidi Chen, Jasper Xian, and Jimmy Lin. Rankllm: A python package for reranking with llms. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '25, pp.\ 3681–3690, New York, NY, USA, ...
-
[42]
Is C hat GPT good at search? investigating large language models as re-ranking agents
Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is C hat GPT good at search? investigating large language models as re-ranking agents. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 14918--14937, Si...
-
[43]
Rohan Surana, Junda Wu, Zhouhang Xie, Yu Xia, Harald Steck, Dawen Liang, Nathan Kallus, and Julian McAuley. From reviews to dialogues: Active synthesis for zero-shot llm-based conversational recommender system. arXiv preprint arXiv:2504.15476, 2025
-
[44]
Rohan Surana, Gagan Mundada, Xunyi Jiang, Chuhan Wang, Zhenwei Tang, Difan Jiao, Zihan Huang, Yuxin Xiong, Junda Wu, Sheldon Yu, Xintong Li, Raghav Jain, Nikki Kuang, Sizhe Zhou, Bowen Jin, Zhendong Chu, Tong Yu, Ryan Rossi, Kuan-Hao Huang, Jingbo Shang, Jiawei Han, and Julian McAuley. Generate, filter, control, replay: A comprehensive survey of rollout s...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
Maximum likelihood reinforcement learning, 2026
Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning, 2026. URL https://arxiv.org/abs/2602.02710
-
[46]
Manveer Singh Tamber, Ronak Pradeep, and Jimmy Lin. Scaling down, litting up: Efficient zero-shot listwise reranking with seq2seq encoder-decoder models, 2023. URL https://arxiv.org/abs/2312.16098
-
[47]
Listwise generative retrieval models via a sequential learning process
Yubao Tang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, and Xueqi Cheng. Listwise generative retrieval models via a sequential learning process. ACM Trans. Inf. Syst., 42 0 (5), April 2024. ISSN 1046-8188. doi:10.1145/3653712. URL https://doi.org/10.1145/3653712
-
[48]
Qwen Team. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
M u S i Q ue: Multihop questions via single-hop question composition
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. M u S i Q ue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10: 0 539--554, 2022. doi:10.1162/tacl_a_00475. URL https://aclanthology.org/2022.tacl-1.31/
-
[50]
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\...
-
[51]
Scenealign: Aligning multimodal reasoning to scene graphs in complex visual scenes
Chuhan Wang, Xintong Li, Jennifer Yuntong Zhang, Junda Wu, Chengkai Huang, Lina Yao, Julian McAuley, and Jingbo Shang. Scenealign: Aligning multimodal reasoning to scene graphs in complex visual scenes. arXiv preprint arXiv:2601.05600, 2026
-
[52]
arXiv preprint arXiv:2304.03153 , year=
Lei Wang and Ee-Peng Lim. Zero-shot next-item recommendation using large pretrained language models, 2023. URL https://arxiv.org/abs/2304.03153
-
[53]
A neural corpus indexer for document retrieval
Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, et al. A neural corpus indexer for document retrieval. Advances in Neural Information Processing Systems, 35: 0 25600--25614, 2022
work page 2022
-
[54]
Ctrls: Chain-of-thought reasoning via latent state-transition
Junda Wu, Yuxin Xiong, Xintong Li, Sheldon Yu, Zhengmian Hu, Tong Yu, Rui Wang, Xiang Chen, Jingbo Shang, and Julian McAuley. Ctrls: Chain-of-thought reasoning via latent state-transition. In The 29th International Conference on Artificial Intelligence and Statistics
-
[55]
Deconfounded and explainable interactive vision-language retrieval of complex scenes
Junda Wu, Tong Yu, and Shuai Li. Deconfounded and explainable interactive vision-language retrieval of complex scenes. MM '21, pp.\ 2103–2111, New York, NY, USA, 2021 a . Association for Computing Machinery. ISBN 9781450386517. doi:10.1145/3474085.3475366. URL https://doi.org/10.1145/3474085.3475366
-
[56]
Clustering of conversational bandits for user preference learning and elicitation
Junda Wu, Canzhe Zhao, Tong Yu, Jingyang Li, and Shuai Li. Clustering of conversational bandits for user preference learning and elicitation. CIKM '21, pp.\ 2129–2139, New York, NY, USA, 2021 b . Association for Computing Machinery. ISBN 9781450384469. doi:10.1145/3459637.3482328. URL https://doi.org/10.1145/3459637.3482328
-
[57]
Dynamics-aware adaptation for reinforcement learning based cross-domain interactive recommendation
Junda Wu, Zhihui Xie, Tong Yu, Handong Zhao, Ruiyi Zhang, and Shuai Li. Dynamics-aware adaptation for reinforcement learning based cross-domain interactive recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '22, pp.\ 290–300, New York, NY, USA, 2022. Association for Com...
-
[58]
Coral: Collaborative retrieval-augmented large language models improve long-tail recommendation
Junda Wu, Cheng-Chun Chang, Tong Yu, Zhankui He, Jianing Wang, Yupeng Hou, and Julian McAuley. Coral: Collaborative retrieval-augmented large language models improve long-tail recommendation. In Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pp.\ 3391--3401, 2024 a
work page 2024
-
[59]
Personalized multimodal large language models: A survey
Junda Wu, Hanjia Lyu, Yu Xia, Zhehao Zhang, Joe Barrow, Ishita Kumar, Mehrnoosh Mirtaheri, Hongjie Chen, Ryan A Rossi, Franck Dernoncourt, et al. Personalized multimodal large language models: A survey. arXiv preprint arXiv:2412.02142, 2024 b
-
[60]
Junda Wu, Tong Yu, Xiang Chen, Haoliang Wang, Ryan Rossi, Sungchul Kim, Anup Rao, and Julian McAuley. Decot: Debiasing chain-of-thought for knowledge-intensive tasks in large language models via causal intervention. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 14073--14087, 2024 c
work page 2024
-
[61]
Junda Wu, Warren Li, Zachary Novack, Amit Namburi, Carol Chen, and Julian McAuley. Collap: Contrastive long-form language-audio pretraining with musical temporal structure augmentation. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2025 a
work page 2025
-
[62]
Ocean: Offline chain-of-thought evaluation and alignment in large language models
Junda Wu, Xintong Li, Ruoyu Wang, Yu Xia, Yuxin Xiong, Jianing Wang, Tong Yu, Xiang Chen, Branislav Kveton, Lina Yao, et al. Ocean: Offline chain-of-thought evaluation and alignment in large language models. In International Conference on Learning Representations, volume 2025, pp.\ 100570--100589, 2025 b
work page 2025
-
[63]
Rossi, Prithviraj Ammanabrolu, and Julian McAuley
Junda Wu, Rohan Surana, Zhouhang Xie, Yiran Shen, Yu Xia, Tong Yu, Ryan A. Rossi, Prithviraj Ammanabrolu, and Julian McAuley. In-context ranking preference optimization. In Second Conference on Language Modeling, 2025 c . URL https://openreview.net/forum?id=L2NPhLAKEd
work page 2025
-
[64]
Doc-react: Multi-page heterogeneous document question-answering
Junda Wu, Yu Xia, Tong Yu, Xiang Chen, Sai Sree Harsha, Akash V Maharaj, Ruiyi Zhang, Victor Bursztyn, Sungchul Kim, Ryan A Rossi, et al. Doc-react: Multi-page heterogeneous document question-answering. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 67--78, 2025 d
work page 2025
-
[65]
Sand: Boosting llm agents with self-taught action deliberation
Yu Xia, Yiran Jenny Shen, Junda Wu, Tong Yu, Sungchul Kim, Ryan A Rossi, Lina Yao, and Julian McAuley. Sand: Boosting llm agents with self-taught action deliberation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 3062--3077, 2025 a
work page 2025
-
[66]
Knowledge-aware query expansion with large language models for textual and relational retrieval
Yu Xia, Junda Wu, Sungchul Kim, Tong Yu, Ryan A Rossi, Haoliang Wang, and Julian McAuley. Knowledge-aware query expansion with large language models for textual and relational retrieval. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long...
work page 2025
-
[67]
A survey on personalized and pluralistic preference alignment in large language models
Zhouhang Xie, Junda Wu, Yiran Shen, Raghav Jain, Yu Xia, Xintong Li, Aaron Chang, Ryan A Rossi, Tong Yu, Sachin Kumar, et al. A survey on personalized and pluralistic preference alignment in large language models. In Second Conference on Language Modeling
-
[68]
Neighborhood-based collaborative filtering for conversational recommendation
Zhouhang Xie, Junda Wu, Hyunsik Jeon, Zhankui He, Harald Steck, Rahul Jha, Dawen Liang, Nathan Kallus, and Julian McAuley. Neighborhood-based collaborative filtering for conversational recommendation. In Proceedings of the 18th ACM Conference on Recommender Systems, pp.\ 1045--1050, 2024
work page 2024
-
[69]
List items one by one: A new data source and learning paradigm for multimodal llms
An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jianwei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, et al. List items one by one: A new data source and learning paradigm for multimodal llms. In First Conference on Language Modeling
-
[70]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. H otpot QA : A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Proce...
-
[71]
DAPO : An open-source LLM reinforcement learning system at scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...
work page 2025
-
[72]
Explainable chain-of-thought reasoning: An empirical analysis on state-aware reasoning dynamics
Sheldon Yu, Yuxin Xiong, Junda Wu, Xintong Li, Tong Yu, Xiang Chen, Ritwik Sinha, Jingbo Shang, and Julian McAuley. Explainable chain-of-thought reasoning: An empirical analysis on state-aware reasoning dynamics. arXiv preprint arXiv:2509.00190, 2025 b
-
[73]
Llamarec: Two-stage recommendation using large language models for ranking, 2023
Zhenrui Yue, Sara Rabhi, Gabriel de Souza Pereira Moreira, Dong Wang, and Even Oldridge. Llamarec: Two-stage recommendation using large language models for ranking, 2023. URL https://arxiv.org/abs/2311.02089
-
[74]
Gvpo: Group variance policy optimization for large language model post-training, 2025
Kaichen Zhang, Yuzhong Hong, Junwei Bao, Hongfei Jiang, Yang Song, Dingqian Hong, and Hui Xiong. Gvpo: Group variance policy optimization for large language model post-training, 2025. URL https://arxiv.org/abs/2504.19599
-
[75]
Rank-grpo: Training llm-based conversational recommender systems with reinforcement learning, 2026
Yaochen Zhu, Harald Steck, Dawen Liang, Yinhan He, Vito Ostuni, Jundong Li, and Nathan Kallus. Rank-grpo: Training llm-based conversational recommender systems with reinforcement learning, 2026. URL https://arxiv.org/abs/2510.20150
-
[76]
Rank-r1: Enhancing reasoning in llm-based document rerankers via reinforcement learning, 2025
Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, and Guido Zuccon. Rank-r1: Enhancing reasoning in llm-based document rerankers via reinforcement learning, 2025. URL https://arxiv.org/abs/2503.06034
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.