arxiv: 2604.10029 · v2 · submitted 2026-04-11 · 💻 cs.IR

Recognition: unknown

Self-Distilled Reinforcement Learning for Co-Evolving Agentic Recommender Systems

Zongwei Wang , Min Gao , Hongzhi Yin , Junliang Yu , Tong Chen , Quoc Viet Hung Nguyen , Shazia Sadiq , Tianrui Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:19 UTC · model grok-4.3

classification 💻 cs.IR

keywords agentic recommender systemsreinforcement learningself-distillationco-evolving agentsmulti-turn interactionsrecommendation performanceuser alignment

0 comments

The pith

CoARS uses self-distilled RL to let recommender and user agents co-evolve by turning their interaction trajectories into internal training signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a reinforcement learning framework called CoARS to optimize large-language-model-based agentic recommender systems. Existing approaches store past interactions only as external textual memory for prompting, which leaves the models dependent on generic reasoning instead of building recommendation-specific skills through parameter updates. CoARS addresses this by deriving coupled rewards from the mutual influence between recommender and user agents and by converting full trajectories into token-level credit assignments via teacher-student distillation. If the method works, agents would internalize dense supervision signals from multi-turn conversations rather than relying solely on final outcomes or external labels. A sympathetic reader cares because this shifts recommendation from one-shot prediction to progressive, interaction-driven learning.

Core claim

CoARS is a self-distilled reinforcement learning framework that introduces an interaction reward deriving coupled task-level supervision for the recommender agent and the user agent from the same trajectory, together with self-distilled credit assignment that converts historical trajectories into token-level credit signals under teacher-student conditioning, thereby enabling the two agents to co-evolve while internalizing experience directly into their parameters.

What carries the argument

Interaction reward for coupled supervision plus self-distilled credit assignment for token-level signals in a co-evolving multi-agent RL loop.

If this is right

CoARS achieves higher recommendation performance than representative ARS baselines across multiple datasets.
The method improves user alignment by allowing agents to refine preferences through ongoing interaction.
Agents acquire recommendation-specific decision-making ability through parameter updates rather than external memory retrieval.
Dense supervision from entire multi-turn trajectories is utilized instead of final outcomes alone.
The interactive nature of recommender and user agents is directly captured to produce mutual endogenous signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same interaction-reward and self-distillation pattern could be applied to other multi-agent LLM systems where agents can supervise each other without human labels.
If the endogenous signals prove stable, training costs for conversational recommenders might drop by reducing dependence on separate RLHF stages.
Testing the framework on longer interaction horizons or with real-time user feedback loops would reveal whether the co-evolution remains stable at scale.

Load-bearing premise

That interaction trajectories between recommender and user agents naturally generate reliable endogenous supervision signals sufficient for stable RL training without external labels or human feedback.

What would settle it

An ablation study in which removing the interaction reward or the self-distillation component causes performance to fall back to the level of standard Reflexion-style ARS baselines, or in which training diverges without added external rewards.

Figures

Figures reproduced from arXiv: 2604.10029 by Hongzhi Yin, Junliang Yu, Min Gao, Quoc Viet Hung Nguyen, Shazia Sadiq, Tianrui Li, Tong Chen, Zongwei Wang.

**Figure 1.** Figure 1: Comparison of evolution paradigms in agentic recommender systems. The left part shows the standard ARS pipeline [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The framework of our CoARS. contain much richer supervision than these existing mechanisms exploit. Each interaction naturally produces coupled outputs and reasoning trajectories from both RecAgent and UserAgent, thereby revealing whether the two agents are collaborating effectively. This makes it possible to naturally transform a single interaction into richer mutual learning signals for both sides. As s… view at source ↗

**Figure 3.** Figure 3: Ablation results of CoARS on the recommender side [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The effect of the hyperparameters. provides finer token-level guidance on where stronger updates are needed. Thus, direct KD and token-level credit assignment play similar roles, but the latter is more naturally aligned with the RL objective. Moreover, direct self-distillation requires computing teacher-student divergence over the full vocabulary at every decoding step, whereas our method avoids this extra… view at source ↗

read the original abstract

Large language model-empowered agentic recommender systems (ARS) reformulate recommendation as a multi-turn interaction between a recommender agent and a user agent, enabling iterative preference elicitation and refinement beyond conventional one-shot prediction. However, existing ARS are mainly optimized in a Reflexion-style paradigm, where past interaction trajectories are stored as textual memory and retrieved as prompt context for later reasoning. Although this design allows agents to recall prior feedback and observations, the accumulated experience remains external to model parameters, leaving agents reliant on generic reasoning rather than progressively acquiring recommendation-specific decision-making ability through learning. Reinforcement learning (RL) therefore provides a natural way to internalize such interaction experience into parameters. Yet existing RL methods for ARS still suffer from two key limitations. First, they fail to capture the interactive nature of ARS, in which the recommender agent and the user agent continuously influence each other and can naturally generate endogenous supervision through interaction feedback. Second, they reduce a rich multi-turn interaction process to final outcomes, overlooking the dense supervision embedded throughout the trajectory. To this end, we propose CoARS, a self-distilled reinforcement learning framework for co-evolving agentic recommender systems. CoARS introduces two complementary learning schemes: interaction reward, which derives coupled task-level supervision for the recommender agent and the user agent from the same interaction trajectory, and self-distilled credit assignment, which converts historical trajectories into token-level credit signals under teacher-student conditioning. Experiments on multiple datasets show that CoARS outperforms representative ARS baselines in recommendation performance and user alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoARS moves agentic recommenders from external Reflexion memory to RL-internalized learning via interaction rewards and self-distilled credit assignment, but the non-stationarity risk from co-evolving policies is left unaddressed in the writeup.

read the letter

The paper's central idea is to stop treating interaction history as prompt context and instead use RL to bake recommendation-specific behavior into the agents' parameters. It does this through an interaction reward that pulls coupled supervision for both recommender and user agents from the same trajectory, and a self-distilled credit assignment that converts those trajectories into token-level signals by conditioning on a teacher policy. This is a direct response to the limitation of Reflexion-style ARS, where agents keep relying on generic reasoning rather than acquiring task-specific skills over time. The endogenous supervision angle is a clean way to get dense feedback without external labels, and it fits the multi-turn nature of these systems better than reducing everything to final outcomes. The abstract shows the authors are aware of the prior work and are trying to fix a real operational shortcoming. The soft spot is exactly the one the stress test flags. Because the user agent's policy updates from the same interactions, the reward signal seen by the recommender is non-stationary by construction. Nothing in the provided text indicates how they regularize against oscillation or collapse— no mention of fixed teachers, explicit variance terms, or monitoring. The performance claims on multiple datasets are stated, but without protocol details, baseline descriptions, or statistical evidence, it's impossible to judge whether the gains come from the new schemes or from other implementation choices. This is for IR researchers already working on agentic or multi-turn recommenders who want concrete RL mechanisms to try. A reader comfortable with multi-agent RL would see the schemes as worth testing, but would need the full methods section to check stability and reproducibility. I would send it to peer review because the problem is well-posed and the proposed fixes are specific enough to be evaluated, even if the current draft leaves the non-stationarity question open.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CoARS, a self-distilled reinforcement learning framework for co-evolving agentic recommender systems (ARS). It reformulates recommendation as multi-turn interactions between a recommender agent and a user agent, proposing an 'interaction reward' that extracts coupled task-level supervision for both agents from shared trajectories and a 'self-distilled credit assignment' mechanism that generates token-level credit signals via teacher-student conditioning on historical data. The central claim is that this internalizes interaction experience into model parameters, overcoming limitations of reflexion-style memory retrieval and sparse final-outcome RL, with experiments on multiple datasets showing superior recommendation performance and user alignment over representative ARS baselines.

Significance. If the empirical results hold under rigorous validation, this work could meaningfully advance RL applications in agentic recommender systems by demonstrating how endogenous supervision from co-evolving agents can densify learning signals beyond conventional approaches. The self-distillation idea for credit assignment is a creative extension that addresses multi-turn trajectory richness. Strengths include the clear motivation from existing ARS limitations and the focus on parameter-internalized learning rather than external memory.

major comments (2)

[Methods (interaction reward)] The interaction reward mechanism (described in the methods section on coupled supervision) assumes that trajectories between simultaneously updating recommender and user agents yield reliable, stable endogenous signals. However, this setup inherently creates non-stationarity for each agent's policy gradient, a well-known multi-agent RL pathology that can cause oscillation or collapse; no anchoring (e.g., fixed teacher policies), explicit regularization terms, or variance monitoring is specified to mitigate this, which directly undermines the claim of stable RL training from interaction feedback.
[Methods (self-distilled credit assignment)] In the self-distilled credit assignment description, the teacher-student conditioning for converting trajectories to token-level signals lacks detail on how it handles credit propagation errors over long multi-turn interactions or prevents the student from inheriting unstable teacher signals during co-evolution. This is load-bearing for the 'dense supervision' advantage over outcome-only RL.

minor comments (2)

[Abstract] The abstract refers to 'multiple datasets' and 'user alignment' metrics without naming the datasets or defining the alignment measure; this should be expanded in the introduction or experiments section for immediate clarity.
[Experiments] Ensure the experiments section includes statistical significance tests, number of random seeds, and learning curve variance for all reported outperformance claims to strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which highlight important considerations for the stability and robustness of our proposed framework. We have carefully addressed each major comment below and revised the manuscript accordingly to provide additional details, analysis, and empirical support.

read point-by-point responses

Referee: [Methods (interaction reward)] The interaction reward mechanism (described in the methods section on coupled supervision) assumes that trajectories between simultaneously updating recommender and user agents yield reliable, stable endogenous signals. However, this setup inherently creates non-stationarity for each agent's policy gradient, a well-known multi-agent RL pathology that can cause oscillation or collapse; no anchoring (e.g., fixed teacher policies), explicit regularization terms, or variance monitoring is specified to mitigate this, which directly undermines the claim of stable RL training from interaction feedback.

Authors: We acknowledge the validity of this concern regarding non-stationarity in co-evolving multi-agent RL. In the original design, the interaction reward is computed from shared trajectories with alternating updates between the recommender and user agents to provide coupled supervision. To directly address the lack of explicit mitigation, we have revised the Methods section (now including a new subsection on training dynamics) to incorporate a policy regularization term that penalizes large gradient updates and to report variance monitoring of policy gradients across training. Additional experiments in the revised version include plots of reward stability and gradient norms over epochs on all datasets, confirming no oscillation or collapse occurs. These changes strengthen the stability claim while preserving the core co-evolution approach. revision: yes
Referee: [Methods (self-distilled credit assignment)] In the self-distilled credit assignment description, the teacher-student conditioning for converting trajectories to token-level signals lacks detail on how it handles credit propagation errors over long multi-turn interactions or prevents the student from inheriting unstable teacher signals during co-evolution. This is load-bearing for the 'dense supervision' advantage over outcome-only RL.

Authors: We agree that further elaboration is warranted on these aspects of the self-distilled credit assignment. The teacher is conditioned on fixed historical trajectories (collected prior to the current co-evolution cycle) and held constant during student training to avoid inheriting instability from ongoing updates. Credit propagation for long trajectories employs a discounted advantage estimator with exponential decay to bound error accumulation from early turns. In the revision, we have expanded Section 3.3 with explicit mathematical details on the conditioning and decay mechanism, added pseudocode for the full teacher-student process, and included an ablation on varying trajectory lengths demonstrating maintained performance gains. This provides the requested rigor supporting the dense supervision benefit. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces CoARS as a self-distilled RL framework using interaction rewards and credit assignment for co-evolving recommender and user agents. The abstract and available text contain no equations, derivations, or self-citations that reduce any claimed result to fitted inputs or prior author work by construction. Claims rest on experimental outperformance rather than closed-form predictions or uniqueness theorems imported from self-citations. The framework is presented as an extension of standard RL without self-definitional loops, ansatz smuggling, or renaming of known results, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities detailed beyond the two new learning schemes introduced as novel contributions.

invented entities (2)

interaction reward no independent evidence
purpose: Derives coupled task-level supervision for both agents from the same trajectory
Newly proposed mechanism to capture interactive nature of ARS
self-distilled credit assignment no independent evidence
purpose: Converts historical trajectories into token-level credit signals under teacher-student conditioning
Newly proposed to provide dense supervision throughout multi-turn process

pith-pipeline@v0.9.0 · 5603 in / 1099 out tokens · 47165 ms · 2026-05-10T16:19:50.142041+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 20 canonical work pages · 12 internal anchors

[1]

Self-attentive sequential recommenda- tion,

W.-C. Kang and J. McAuley, “Self-attentive sequential recommenda- tion,” in2018 IEEE international conference on data mining (ICDM). IEEE, 2018, pp. 197–206

2018
[2]

Lightgcn: Simplifying and powering graph convolution network for recommenda- tion,

X. He, K. Deng, X. Wang, Y . Li, Y . Zhang, and M. Wang, “Lightgcn: Simplifying and powering graph convolution network for recommenda- tion,” inProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, 2020, pp. 639– 648

2020
[3]

Efficient bi- level optimization for recommendation denoising,

Z. Wang, M. Gao, W. Li, J. Yu, L. Guo, and H. Yin, “Efficient bi- level optimization for recommendation denoising,” inProceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, 2023, pp. 2502–2511

2023
[4]

Agentic feedback loop modeling improves recommendation and user simulation,

S. Cai, J. Zhang, K. Bao, C. Gao, Q. Wang, F. Feng, and X. He, “Agentic feedback loop modeling improves recommendation and user simulation,” inProceedings of the 48th International ACM SIGIR conference on Research and Development in Information Retrieval, 2025, pp. 2235– 2244

2025
[5]

iagent: Llm agent as a shield between user and recommender systems,

W. Xu, Y . Shi, Z. Liang, X. Ning, K. Mei, K. Wang, X. Zhu, M. Xu, and Y . Zhang, “iagent: Llm agent as a shield between user and recommender systems,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 18 056–18 084

2025
[6]

On generative agents in recommendation,

A. Zhang, Y . Chen, L. Sheng, X. Wang, and T.-S. Chua, “On generative agents in recommendation,” inProceedings of the 47th international ACM SIGIR conference on research and development in Information Retrieval, 2024, pp. 1807–1817

2024
[7]

Recommender ai agent: Integrating large language models for interactive recommen- dations,

X. Huang, J. Lian, Y . Lei, J. Yao, D. Lian, and X. Xie, “Recommender ai agent: Integrating large language models for interactive recommen- dations,”ACM Transactions on Information Systems, vol. 43, no. 4, pp. 1–33, 2025

2025
[8]

Re- flexion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Re- flexion: Language agents with verbal reinforcement learning,”Advances in neural information processing systems, vol. 36, pp. 8634–8652, 2023

2023
[9]

Entropy guided diversification and preference elicitation in agentic recommendation systems,

D. Tran, Y . Li, H. Clay, N. Golrezaei, S. Beygi, and A. Saberi, “Entropy guided diversification and preference elicitation in agentic recommendation systems,”arXiv preprint arXiv:2603.11399, 2026

work page arXiv 2026
[10]

Memorybank: En- hancing large language models with long-term memory,

W. Zhong, L. Guo, Q. Gao, H. Ye, and Y . Wang, “Memorybank: En- hancing large language models with long-term memory,” inProceedings of the AAAI conference on artificial intelligence, vol. 38, no. 17, 2024, pp. 19 724–19 731

2024
[11]

A-MEM: Agentic Memory for LLM Agents

W. Xu, Z. Liang, K. Mei, H. Gao, J. Tan, and Y . Zhang, “A-mem: Agentic memory for llm agents,”arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review arXiv 2025
[12]

Tallrec: An effective and efficient tuning framework to align large language model with recommendation,

K. Bao, J. Zhang, Y . Zhang, W. Wang, F. Feng, and X. He, “Tallrec: An effective and efficient tuning framework to align large language model with recommendation,” inProceedings of the 17th ACM conference on recommender systems, 2023, pp. 1007–1014

2023
[13]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, L. Liuet al., “Dapo: An open-source llm reinforcement learning system at scale,”arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning,

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Biet al., “Deepseek-r1 incentivizes reasoning in llms through reinforcement learning,”Nature, vol. 645, no. 8081, pp. 633–638, 2025

2025
[16]

Amem4rec: Leveraging cross-user similarity for memory evolution in agentic llm recom- menders,

M.-D. Nguyen, H.-D. Kieu, and D. D. Le, “Amem4rec: Leveraging cross-user similarity for memory evolution in agentic llm recom- menders,”arXiv preprint arXiv:2602.08837, 2026

work page arXiv 2026
[17]

A survey on agent-as-a-judge, 2026

R. You, H. Cai, C. Zhang, Q. Xu, M. Liu, T. Yu, Y . Li, and W. Li, “Agent-as-a-judge,”arXiv preprint arXiv:2601.05111, 2026

work page arXiv 2026
[18]

Recoworld: Building simulated environments for agentic recommender systems,

F. Liu, X. Lin, H. Yu, M. Wu, J. Wang, Q. Zhang, Z. Zhao, Y . Xia, Y . Zhang, W. Liet al., “Recoworld: Building simulated environments for agentic recommender systems,”arXiv preprint arXiv:2509.10397, 2025

work page arXiv 2025
[19]

Ruleagent: Dis- covering rules for recommendation denoising with autonomous language agents,

Z. Wang, M. Gao, J. Yu, Y . Hou, S. Sadiq, and H. Yin, “Ruleagent: Dis- covering rules for recommendation denoising with autonomous language agents,”arXiv preprint arXiv:2503.23374, 2025

work page arXiv 2025
[20]

Macrec: A multi- agent collaboration framework for recommendation,

Z. Wang, Y . Yu, W. Zheng, W. Ma, and M. Zhang, “Macrec: A multi- agent collaboration framework for recommendation,” inProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, pp. 2760–2764

2024
[21]

Expanding the capabilities of reinforcement learning via text feedback.arXiv preprint arXiv:2602.02482, 2026

Y . Song, L. Chen, F. Tajwar, R. Munos, D. Pathak, J. A. Bagnell, A. Singh, and A. Zanette, “Expanding the capabilities of reinforcement learning via text feedback,”arXiv preprint arXiv:2602.02482, 2026

work page arXiv 2026
[22]

Treerl: Llm reinforcement learning with on-policy tree search,

Z. Hou, Z. Hu, Y . Li, R. Lu, J. Tang, and Y . Dong, “Treerl: Llm reinforcement learning with on-policy tree search,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 12 355–12 369

2025
[23]

Supervised pretraining can learn in-context reinforcement learning,

J. Lee, A. Xie, A. Pacchiano, Y . Chandak, C. Finn, O. Nachum, and E. Brunskill, “Supervised pretraining can learn in-context reinforcement learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 43 057–43 083, 2023

2023
[24]

Reward Is Enough: LLMs Are In-Context Reinforcement Learners

K. Song, A. Moeini, P. Wang, L. Gong, R. Chandra, S. Zhang, and Y . Qi, “Reward is enough: Llms are in-context reinforcement learners,” arXiv preprint arXiv:2506.06303, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

X. Wen, Z. Liu, S. Zheng, S. Ye, Z. Wu, Y . Wang, Z. Xu, X. Liang, J. Li, Z. Miaoet al., “Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms,”arXiv preprint arXiv:2506.14245, 2025

work page internal anchor Pith review arXiv 2025
[26]

Self-Distillation Enables Continual Learning

I. Shenfeld, M. Damani, J. H ¨ubotter, and P. Agrawal, “Self-distillation enables continual learning,”arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review arXiv 2026
[27]

Reinforcement Learning via Self-Distillation

J. H ¨ubotter, F. L¨ubeck, L. Behric, A. Baumann, M. Bagatella, D. Marta, I. Hakimi, I. Shenfeld, T. K. Buening, C. Guestrinet al., “Reinforcement learning via self-distillation,”arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review arXiv 2026
[29]

Self-Distilled RLVR

C. Yang, C. Qin, Q. Si, M. Chen, N. Gu, D. Yao, Z. Lin, W. Wang, J. Wang, and N. Duan, “Self-distilled rlvr,”arXiv preprint arXiv:2604.03128, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Second workshop on infor- mation heterogeneity and fusion in recommender systems (hetrec2011),

I. Cantador, P. Brusilovsky, and T. Kuflik, “Second workshop on infor- mation heterogeneity and fusion in recommender systems (hetrec2011),” inProceedings of the fifth ACM conference on Recommender systems, 2011, pp. 387–388

2011
[31]

The movielens datasets: History and context,

F. M. Harper and J. A. Konstan, “The movielens datasets: History and context,”Acm transactions on interactive intelligent systems (tiis), vol. 5, no. 4, pp. 1–19, 2015

2015
[32]

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

Y . Hou, J. Li, Z. He, A. Yan, X. Chen, and J. McAuley, “Bridging language and items for retrieval and recommendation,”arXiv preprint arXiv:2403.03952, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Let me do it for you: Towards llm empowered recommendation via tool learning,

Y . Zhao, J. Wu, X. Wang, W. Tang, D. Wang, and M. De Rijke, “Let me do it for you: Towards llm empowered recommendation via tool learning,” inProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, pp. 1796–1806

2024
[34]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” Iclr, vol. 1, no. 2, p. 3, 2022

2022
[35]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

S. Zhao, Z. Xie, M. Liu, J. Huang, G. Pang, F. Chen, and A. Grover, “Self-distilled reasoner: On-policy self-distillation for large language models,”arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review arXiv 2026
[36]

Negotiating the shared agency between humans & ai in the recommender system,

M. Wu, W. Liu, Y . Wang, and M. Yao, “Negotiating the shared agency between humans & ai in the recommender system,” inExtended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’25), 2025

2025
[37]

User behavior simulation with large lan- guage model-based agents,

L. Wang, J. Zhang, H. Yang, Z.-Y . Chen, J. Tang, Z. Zhang, X. Chen, Y . Lin, H. Sun, R. Songet al., “User behavior simulation with large lan- guage model-based agents,”ACM Transactions on Information Systems, vol. 43, no. 2, pp. 1–37, 2025

2025
[38]

Recmind: Large language model powered agent for recommendation,

Y . Wang, Z. Jiang, Z. Chen, F. Yang, Y . Zhou, E. Cho, X. Fan, Y . Lu, X. Huang, and Y . Yang, “Recmind: Large language model powered agent for recommendation,” inFindings of the Association for Computational Linguistics: NAACL 2024, 2024, pp. 4351–4364

2024
[39]

Id-free not risk-free: Llm-powered agents unveil risks in id- free recommender systems,

Z. Wang, M. Gao, J. Yu, X. Gao, Q. V . H. Nguyen, S. Sadiq, and H. Yin, “Id-free not risk-free: Llm-powered agents unveil risks in id- free recommender systems,” inProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2025, pp. 1902–1911

2025
[40]

MemRec: Collaborative Memory-Augmented Agentic Recommender System

W. Chen, Y . Zhao, J. Huang, Z. Ye, C. M. Ju, T. Zhao, N. Shah, L. Chen, and Y . Zhang, “Memrec: Collaborative memory-augmented agentic recommender system,”arXiv preprint arXiv:2601.08816, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

Agentcf: Collaborative learning with autonomous language agents for recommender systems,

J. Zhang, Y . Hou, R. Xie, W. Sun, J. McAuley, W. X. Zhao, L. Lin, and J.-R. Wen, “Agentcf: Collaborative learning with autonomous language agents for recommender systems,” inProceedings of the ACM Web Conference 2024, 2024, pp. 3679–3689

2024
[42]

Agentcf++: Memory-enhanced llm-based agents for popularity-aware cross-domain recommendations,

J. Liu, S. Gu, D. Li, G. Zhang, M. Han, H. Gu, P. Zhang, T. Lu, L. Shang, and N. Gu, “Agentcf++: Memory-enhanced llm-based agents for popularity-aware cross-domain recommendations,” inProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2025, pp. 2566–2571. JOURNAL OF LATEX CLASS FILES, VOL. 14, ...

2025
[43]

Multi-agent collaborative filtering: Orchestrating users and items for agentic rec- ommendations,

Y . Xia, S. Kim, T. Yu, R. A. Rossi, and J. McAuley, “Multi-agent collaborative filtering: Orchestrating users and items for agentic rec- ommendations,”arXiv preprint arXiv:2511.18413, 2025

work page arXiv 2025
[44]

Recnet: Self-evolving preference propagation for agentic recommender systems.arXiv preprint arXiv:2601.21609,

B. Li, X. Wang, J. Li, W. Li, L. Zhang, S. Chen, W. X. Zhao, and J.-R. Wen, “Recnet: Self-evolving preference propagation for agentic recommender systems,”arXiv preprint arXiv:2601.21609, 2026

work page arXiv 2026
[45]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024. Zongwei Wangis currently pursuing his Ph.D. at Chongqing University. His research has been pub- lished in top data mining conferences and jour...

work page internal anchor Pith review Pith/arXiv arXiv 2024