pith. machine review for the scientific record. sign in

arxiv: 2604.10029 · v2 · submitted 2026-04-11 · 💻 cs.IR

Recognition: unknown

Self-Distilled Reinforcement Learning for Co-Evolving Agentic Recommender Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:19 UTC · model grok-4.3

classification 💻 cs.IR
keywords agentic recommender systemsreinforcement learningself-distillationco-evolving agentsmulti-turn interactionsrecommendation performanceuser alignment
0
0 comments X

The pith

CoARS uses self-distilled RL to let recommender and user agents co-evolve by turning their interaction trajectories into internal training signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a reinforcement learning framework called CoARS to optimize large-language-model-based agentic recommender systems. Existing approaches store past interactions only as external textual memory for prompting, which leaves the models dependent on generic reasoning instead of building recommendation-specific skills through parameter updates. CoARS addresses this by deriving coupled rewards from the mutual influence between recommender and user agents and by converting full trajectories into token-level credit assignments via teacher-student distillation. If the method works, agents would internalize dense supervision signals from multi-turn conversations rather than relying solely on final outcomes or external labels. A sympathetic reader cares because this shifts recommendation from one-shot prediction to progressive, interaction-driven learning.

Core claim

CoARS is a self-distilled reinforcement learning framework that introduces an interaction reward deriving coupled task-level supervision for the recommender agent and the user agent from the same trajectory, together with self-distilled credit assignment that converts historical trajectories into token-level credit signals under teacher-student conditioning, thereby enabling the two agents to co-evolve while internalizing experience directly into their parameters.

What carries the argument

Interaction reward for coupled supervision plus self-distilled credit assignment for token-level signals in a co-evolving multi-agent RL loop.

If this is right

  • CoARS achieves higher recommendation performance than representative ARS baselines across multiple datasets.
  • The method improves user alignment by allowing agents to refine preferences through ongoing interaction.
  • Agents acquire recommendation-specific decision-making ability through parameter updates rather than external memory retrieval.
  • Dense supervision from entire multi-turn trajectories is utilized instead of final outcomes alone.
  • The interactive nature of recommender and user agents is directly captured to produce mutual endogenous signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same interaction-reward and self-distillation pattern could be applied to other multi-agent LLM systems where agents can supervise each other without human labels.
  • If the endogenous signals prove stable, training costs for conversational recommenders might drop by reducing dependence on separate RLHF stages.
  • Testing the framework on longer interaction horizons or with real-time user feedback loops would reveal whether the co-evolution remains stable at scale.

Load-bearing premise

That interaction trajectories between recommender and user agents naturally generate reliable endogenous supervision signals sufficient for stable RL training without external labels or human feedback.

What would settle it

An ablation study in which removing the interaction reward or the self-distillation component causes performance to fall back to the level of standard Reflexion-style ARS baselines, or in which training diverges without added external rewards.

Figures

Figures reproduced from arXiv: 2604.10029 by Hongzhi Yin, Junliang Yu, Min Gao, Quoc Viet Hung Nguyen, Shazia Sadiq, Tianrui Li, Tong Chen, Zongwei Wang.

Figure 1
Figure 1. Figure 1: Comparison of evolution paradigms in agentic recommender systems. The left part shows the standard ARS pipeline [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The framework of our CoARS. contain much richer supervision than these existing mech￾anisms exploit. Each interaction naturally produces coupled outputs and reasoning trajectories from both RecAgent and UserAgent, thereby revealing whether the two agents are collaborating effectively. This makes it possible to naturally transform a single interaction into richer mutual learning signals for both sides. As s… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation results of CoARS on the recommender side [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The effect of the hyperparameters. provides finer token-level guidance on where stronger updates are needed. Thus, direct KD and token-level credit assignment play similar roles, but the latter is more naturally aligned with the RL objective. Moreover, direct self-distillation requires computing teacher-student divergence over the full vocabulary at every decoding step, whereas our method avoids this extra… view at source ↗
read the original abstract

Large language model-empowered agentic recommender systems (ARS) reformulate recommendation as a multi-turn interaction between a recommender agent and a user agent, enabling iterative preference elicitation and refinement beyond conventional one-shot prediction. However, existing ARS are mainly optimized in a Reflexion-style paradigm, where past interaction trajectories are stored as textual memory and retrieved as prompt context for later reasoning. Although this design allows agents to recall prior feedback and observations, the accumulated experience remains external to model parameters, leaving agents reliant on generic reasoning rather than progressively acquiring recommendation-specific decision-making ability through learning. Reinforcement learning (RL) therefore provides a natural way to internalize such interaction experience into parameters. Yet existing RL methods for ARS still suffer from two key limitations. First, they fail to capture the interactive nature of ARS, in which the recommender agent and the user agent continuously influence each other and can naturally generate endogenous supervision through interaction feedback. Second, they reduce a rich multi-turn interaction process to final outcomes, overlooking the dense supervision embedded throughout the trajectory. To this end, we propose CoARS, a self-distilled reinforcement learning framework for co-evolving agentic recommender systems. CoARS introduces two complementary learning schemes: interaction reward, which derives coupled task-level supervision for the recommender agent and the user agent from the same interaction trajectory, and self-distilled credit assignment, which converts historical trajectories into token-level credit signals under teacher-student conditioning. Experiments on multiple datasets show that CoARS outperforms representative ARS baselines in recommendation performance and user alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CoARS, a self-distilled reinforcement learning framework for co-evolving agentic recommender systems (ARS). It reformulates recommendation as multi-turn interactions between a recommender agent and a user agent, proposing an 'interaction reward' that extracts coupled task-level supervision for both agents from shared trajectories and a 'self-distilled credit assignment' mechanism that generates token-level credit signals via teacher-student conditioning on historical data. The central claim is that this internalizes interaction experience into model parameters, overcoming limitations of reflexion-style memory retrieval and sparse final-outcome RL, with experiments on multiple datasets showing superior recommendation performance and user alignment over representative ARS baselines.

Significance. If the empirical results hold under rigorous validation, this work could meaningfully advance RL applications in agentic recommender systems by demonstrating how endogenous supervision from co-evolving agents can densify learning signals beyond conventional approaches. The self-distillation idea for credit assignment is a creative extension that addresses multi-turn trajectory richness. Strengths include the clear motivation from existing ARS limitations and the focus on parameter-internalized learning rather than external memory.

major comments (2)
  1. [Methods (interaction reward)] The interaction reward mechanism (described in the methods section on coupled supervision) assumes that trajectories between simultaneously updating recommender and user agents yield reliable, stable endogenous signals. However, this setup inherently creates non-stationarity for each agent's policy gradient, a well-known multi-agent RL pathology that can cause oscillation or collapse; no anchoring (e.g., fixed teacher policies), explicit regularization terms, or variance monitoring is specified to mitigate this, which directly undermines the claim of stable RL training from interaction feedback.
  2. [Methods (self-distilled credit assignment)] In the self-distilled credit assignment description, the teacher-student conditioning for converting trajectories to token-level signals lacks detail on how it handles credit propagation errors over long multi-turn interactions or prevents the student from inheriting unstable teacher signals during co-evolution. This is load-bearing for the 'dense supervision' advantage over outcome-only RL.
minor comments (2)
  1. [Abstract] The abstract refers to 'multiple datasets' and 'user alignment' metrics without naming the datasets or defining the alignment measure; this should be expanded in the introduction or experiments section for immediate clarity.
  2. [Experiments] Ensure the experiments section includes statistical significance tests, number of random seeds, and learning curve variance for all reported outperformance claims to strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which highlight important considerations for the stability and robustness of our proposed framework. We have carefully addressed each major comment below and revised the manuscript accordingly to provide additional details, analysis, and empirical support.

read point-by-point responses
  1. Referee: [Methods (interaction reward)] The interaction reward mechanism (described in the methods section on coupled supervision) assumes that trajectories between simultaneously updating recommender and user agents yield reliable, stable endogenous signals. However, this setup inherently creates non-stationarity for each agent's policy gradient, a well-known multi-agent RL pathology that can cause oscillation or collapse; no anchoring (e.g., fixed teacher policies), explicit regularization terms, or variance monitoring is specified to mitigate this, which directly undermines the claim of stable RL training from interaction feedback.

    Authors: We acknowledge the validity of this concern regarding non-stationarity in co-evolving multi-agent RL. In the original design, the interaction reward is computed from shared trajectories with alternating updates between the recommender and user agents to provide coupled supervision. To directly address the lack of explicit mitigation, we have revised the Methods section (now including a new subsection on training dynamics) to incorporate a policy regularization term that penalizes large gradient updates and to report variance monitoring of policy gradients across training. Additional experiments in the revised version include plots of reward stability and gradient norms over epochs on all datasets, confirming no oscillation or collapse occurs. These changes strengthen the stability claim while preserving the core co-evolution approach. revision: yes

  2. Referee: [Methods (self-distilled credit assignment)] In the self-distilled credit assignment description, the teacher-student conditioning for converting trajectories to token-level signals lacks detail on how it handles credit propagation errors over long multi-turn interactions or prevents the student from inheriting unstable teacher signals during co-evolution. This is load-bearing for the 'dense supervision' advantage over outcome-only RL.

    Authors: We agree that further elaboration is warranted on these aspects of the self-distilled credit assignment. The teacher is conditioned on fixed historical trajectories (collected prior to the current co-evolution cycle) and held constant during student training to avoid inheriting instability from ongoing updates. Credit propagation for long trajectories employs a discounted advantage estimator with exponential decay to bound error accumulation from early turns. In the revision, we have expanded Section 3.3 with explicit mathematical details on the conditioning and decay mechanism, added pseudocode for the full teacher-student process, and included an ablation on varying trajectory lengths demonstrating maintained performance gains. This provides the requested rigor supporting the dense supervision benefit. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces CoARS as a self-distilled RL framework using interaction rewards and credit assignment for co-evolving recommender and user agents. The abstract and available text contain no equations, derivations, or self-citations that reduce any claimed result to fitted inputs or prior author work by construction. Claims rest on experimental outperformance rather than closed-form predictions or uniqueness theorems imported from self-citations. The framework is presented as an extension of standard RL without self-definitional loops, ansatz smuggling, or renaming of known results, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities detailed beyond the two new learning schemes introduced as novel contributions.

invented entities (2)
  • interaction reward no independent evidence
    purpose: Derives coupled task-level supervision for both agents from the same trajectory
    Newly proposed mechanism to capture interactive nature of ARS
  • self-distilled credit assignment no independent evidence
    purpose: Converts historical trajectories into token-level credit signals under teacher-student conditioning
    Newly proposed to provide dense supervision throughout multi-turn process

pith-pipeline@v0.9.0 · 5603 in / 1099 out tokens · 47165 ms · 2026-05-10T16:19:50.142041+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 20 canonical work pages · 12 internal anchors

  1. [1]

    Self-attentive sequential recommenda- tion,

    W.-C. Kang and J. McAuley, “Self-attentive sequential recommenda- tion,” in2018 IEEE international conference on data mining (ICDM). IEEE, 2018, pp. 197–206

  2. [2]

    Lightgcn: Simplifying and powering graph convolution network for recommenda- tion,

    X. He, K. Deng, X. Wang, Y . Li, Y . Zhang, and M. Wang, “Lightgcn: Simplifying and powering graph convolution network for recommenda- tion,” inProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, 2020, pp. 639– 648

  3. [3]

    Efficient bi- level optimization for recommendation denoising,

    Z. Wang, M. Gao, W. Li, J. Yu, L. Guo, and H. Yin, “Efficient bi- level optimization for recommendation denoising,” inProceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, 2023, pp. 2502–2511

  4. [4]

    Agentic feedback loop modeling improves recommendation and user simulation,

    S. Cai, J. Zhang, K. Bao, C. Gao, Q. Wang, F. Feng, and X. He, “Agentic feedback loop modeling improves recommendation and user simulation,” inProceedings of the 48th International ACM SIGIR conference on Research and Development in Information Retrieval, 2025, pp. 2235– 2244

  5. [5]

    iagent: Llm agent as a shield between user and recommender systems,

    W. Xu, Y . Shi, Z. Liang, X. Ning, K. Mei, K. Wang, X. Zhu, M. Xu, and Y . Zhang, “iagent: Llm agent as a shield between user and recommender systems,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 18 056–18 084

  6. [6]

    On generative agents in recommendation,

    A. Zhang, Y . Chen, L. Sheng, X. Wang, and T.-S. Chua, “On generative agents in recommendation,” inProceedings of the 47th international ACM SIGIR conference on research and development in Information Retrieval, 2024, pp. 1807–1817

  7. [7]

    Recommender ai agent: Integrating large language models for interactive recommen- dations,

    X. Huang, J. Lian, Y . Lei, J. Yao, D. Lian, and X. Xie, “Recommender ai agent: Integrating large language models for interactive recommen- dations,”ACM Transactions on Information Systems, vol. 43, no. 4, pp. 1–33, 2025

  8. [8]

    Re- flexion: Language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Re- flexion: Language agents with verbal reinforcement learning,”Advances in neural information processing systems, vol. 36, pp. 8634–8652, 2023

  9. [9]

    Entropy guided diversification and preference elicitation in agentic recommendation systems,

    D. Tran, Y . Li, H. Clay, N. Golrezaei, S. Beygi, and A. Saberi, “Entropy guided diversification and preference elicitation in agentic recommendation systems,”arXiv preprint arXiv:2603.11399, 2026

  10. [10]

    Memorybank: En- hancing large language models with long-term memory,

    W. Zhong, L. Guo, Q. Gao, H. Ye, and Y . Wang, “Memorybank: En- hancing large language models with long-term memory,” inProceedings of the AAAI conference on artificial intelligence, vol. 38, no. 17, 2024, pp. 19 724–19 731

  11. [11]

    A-MEM: Agentic Memory for LLM Agents

    W. Xu, Z. Liang, K. Mei, H. Gao, J. Tan, and Y . Zhang, “A-mem: Agentic memory for llm agents,”arXiv preprint arXiv:2502.12110, 2025

  12. [12]

    Tallrec: An effective and efficient tuning framework to align large language model with recommendation,

    K. Bao, J. Zhang, Y . Zhang, W. Wang, F. Feng, and X. He, “Tallrec: An effective and efficient tuning framework to align large language model with recommendation,” inProceedings of the 17th ACM conference on recommender systems, 2023, pp. 1007–1014

  13. [13]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, L. Liuet al., “Dapo: An open-source llm reinforcement learning system at scale,”arXiv preprint arXiv:2503.14476, 2025

  14. [14]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  15. [15]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning,

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Biet al., “Deepseek-r1 incentivizes reasoning in llms through reinforcement learning,”Nature, vol. 645, no. 8081, pp. 633–638, 2025

  16. [16]

    Amem4rec: Leveraging cross-user similarity for memory evolution in agentic llm recom- menders,

    M.-D. Nguyen, H.-D. Kieu, and D. D. Le, “Amem4rec: Leveraging cross-user similarity for memory evolution in agentic llm recom- menders,”arXiv preprint arXiv:2602.08837, 2026

  17. [17]

    A survey on agent-as-a-judge, 2026

    R. You, H. Cai, C. Zhang, Q. Xu, M. Liu, T. Yu, Y . Li, and W. Li, “Agent-as-a-judge,”arXiv preprint arXiv:2601.05111, 2026

  18. [18]

    Recoworld: Building simulated environments for agentic recommender systems,

    F. Liu, X. Lin, H. Yu, M. Wu, J. Wang, Q. Zhang, Z. Zhao, Y . Xia, Y . Zhang, W. Liet al., “Recoworld: Building simulated environments for agentic recommender systems,”arXiv preprint arXiv:2509.10397, 2025

  19. [19]

    Ruleagent: Dis- covering rules for recommendation denoising with autonomous language agents,

    Z. Wang, M. Gao, J. Yu, Y . Hou, S. Sadiq, and H. Yin, “Ruleagent: Dis- covering rules for recommendation denoising with autonomous language agents,”arXiv preprint arXiv:2503.23374, 2025

  20. [20]

    Macrec: A multi- agent collaboration framework for recommendation,

    Z. Wang, Y . Yu, W. Zheng, W. Ma, and M. Zhang, “Macrec: A multi- agent collaboration framework for recommendation,” inProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, pp. 2760–2764

  21. [21]

    Expanding the capabilities of reinforcement learning via text feedback.arXiv preprint arXiv:2602.02482, 2026

    Y . Song, L. Chen, F. Tajwar, R. Munos, D. Pathak, J. A. Bagnell, A. Singh, and A. Zanette, “Expanding the capabilities of reinforcement learning via text feedback,”arXiv preprint arXiv:2602.02482, 2026

  22. [22]

    Treerl: Llm reinforcement learning with on-policy tree search,

    Z. Hou, Z. Hu, Y . Li, R. Lu, J. Tang, and Y . Dong, “Treerl: Llm reinforcement learning with on-policy tree search,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 12 355–12 369

  23. [23]

    Supervised pretraining can learn in-context reinforcement learning,

    J. Lee, A. Xie, A. Pacchiano, Y . Chandak, C. Finn, O. Nachum, and E. Brunskill, “Supervised pretraining can learn in-context reinforcement learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 43 057–43 083, 2023

  24. [24]

    Reward Is Enough: LLMs Are In-Context Reinforcement Learners

    K. Song, A. Moeini, P. Wang, L. Gong, R. Chandra, S. Zhang, and Y . Qi, “Reward is enough: Llms are in-context reinforcement learners,” arXiv preprint arXiv:2506.06303, 2025

  25. [25]

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

    X. Wen, Z. Liu, S. Zheng, S. Ye, Z. Wu, Y . Wang, Z. Xu, X. Liang, J. Li, Z. Miaoet al., “Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms,”arXiv preprint arXiv:2506.14245, 2025

  26. [26]

    Self-Distillation Enables Continual Learning

    I. Shenfeld, M. Damani, J. H ¨ubotter, and P. Agrawal, “Self-distillation enables continual learning,”arXiv preprint arXiv:2601.19897, 2026

  27. [27]

    Reinforcement Learning via Self-Distillation

    J. H ¨ubotter, F. L¨ubeck, L. Behric, A. Baumann, M. Bagatella, D. Marta, I. Hakimi, I. Shenfeld, T. K. Buening, C. Guestrinet al., “Reinforcement learning via self-distillation,”arXiv preprint arXiv:2601.20802, 2026

  28. [29]

    Self-Distilled RLVR

    C. Yang, C. Qin, Q. Si, M. Chen, N. Gu, D. Yao, Z. Lin, W. Wang, J. Wang, and N. Duan, “Self-distilled rlvr,”arXiv preprint arXiv:2604.03128, 2026

  29. [30]

    Second workshop on infor- mation heterogeneity and fusion in recommender systems (hetrec2011),

    I. Cantador, P. Brusilovsky, and T. Kuflik, “Second workshop on infor- mation heterogeneity and fusion in recommender systems (hetrec2011),” inProceedings of the fifth ACM conference on Recommender systems, 2011, pp. 387–388

  30. [31]

    The movielens datasets: History and context,

    F. M. Harper and J. A. Konstan, “The movielens datasets: History and context,”Acm transactions on interactive intelligent systems (tiis), vol. 5, no. 4, pp. 1–19, 2015

  31. [32]

    Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

    Y . Hou, J. Li, Z. He, A. Yan, X. Chen, and J. McAuley, “Bridging language and items for retrieval and recommendation,”arXiv preprint arXiv:2403.03952, 2024

  32. [33]

    Let me do it for you: Towards llm empowered recommendation via tool learning,

    Y . Zhao, J. Wu, X. Wang, W. Tang, D. Wang, and M. De Rijke, “Let me do it for you: Towards llm empowered recommendation via tool learning,” inProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, pp. 1796–1806

  33. [34]

    Lora: Low-rank adaptation of large language models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” Iclr, vol. 1, no. 2, p. 3, 2022

  34. [35]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    S. Zhao, Z. Xie, M. Liu, J. Huang, G. Pang, F. Chen, and A. Grover, “Self-distilled reasoner: On-policy self-distillation for large language models,”arXiv preprint arXiv:2601.18734, 2026

  35. [36]

    Negotiating the shared agency between humans & ai in the recommender system,

    M. Wu, W. Liu, Y . Wang, and M. Yao, “Negotiating the shared agency between humans & ai in the recommender system,” inExtended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’25), 2025

  36. [37]

    User behavior simulation with large lan- guage model-based agents,

    L. Wang, J. Zhang, H. Yang, Z.-Y . Chen, J. Tang, Z. Zhang, X. Chen, Y . Lin, H. Sun, R. Songet al., “User behavior simulation with large lan- guage model-based agents,”ACM Transactions on Information Systems, vol. 43, no. 2, pp. 1–37, 2025

  37. [38]

    Recmind: Large language model powered agent for recommendation,

    Y . Wang, Z. Jiang, Z. Chen, F. Yang, Y . Zhou, E. Cho, X. Fan, Y . Lu, X. Huang, and Y . Yang, “Recmind: Large language model powered agent for recommendation,” inFindings of the Association for Computational Linguistics: NAACL 2024, 2024, pp. 4351–4364

  38. [39]

    Id-free not risk-free: Llm-powered agents unveil risks in id- free recommender systems,

    Z. Wang, M. Gao, J. Yu, X. Gao, Q. V . H. Nguyen, S. Sadiq, and H. Yin, “Id-free not risk-free: Llm-powered agents unveil risks in id- free recommender systems,” inProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2025, pp. 1902–1911

  39. [40]

    MemRec: Collaborative Memory-Augmented Agentic Recommender System

    W. Chen, Y . Zhao, J. Huang, Z. Ye, C. M. Ju, T. Zhao, N. Shah, L. Chen, and Y . Zhang, “Memrec: Collaborative memory-augmented agentic recommender system,”arXiv preprint arXiv:2601.08816, 2026

  40. [41]

    Agentcf: Collaborative learning with autonomous language agents for recommender systems,

    J. Zhang, Y . Hou, R. Xie, W. Sun, J. McAuley, W. X. Zhao, L. Lin, and J.-R. Wen, “Agentcf: Collaborative learning with autonomous language agents for recommender systems,” inProceedings of the ACM Web Conference 2024, 2024, pp. 3679–3689

  41. [42]

    Agentcf++: Memory-enhanced llm-based agents for popularity-aware cross-domain recommendations,

    J. Liu, S. Gu, D. Li, G. Zhang, M. Han, H. Gu, P. Zhang, T. Lu, L. Shang, and N. Gu, “Agentcf++: Memory-enhanced llm-based agents for popularity-aware cross-domain recommendations,” inProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2025, pp. 2566–2571. JOURNAL OF LATEX CLASS FILES, VOL. 14, ...

  42. [43]

    Multi-agent collaborative filtering: Orchestrating users and items for agentic rec- ommendations,

    Y . Xia, S. Kim, T. Yu, R. A. Rossi, and J. McAuley, “Multi-agent collaborative filtering: Orchestrating users and items for agentic rec- ommendations,”arXiv preprint arXiv:2511.18413, 2025

  43. [44]

    Recnet: Self-evolving preference propagation for agentic recommender systems.arXiv preprint arXiv:2601.21609,

    B. Li, X. Wang, J. Li, W. Li, L. Zhang, S. Chen, W. X. Zhao, and J.-R. Wen, “Recnet: Self-evolving preference propagation for agentic recommender systems,”arXiv preprint arXiv:2601.21609, 2026

  44. [45]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024. Zongwei Wangis currently pursuing his Ph.D. at Chongqing University. His research has been pub- lished in top data mining conferences and jour...