CollabSim is a new CSCW-grounded simulation framework that enables controlled multi-agent experiments to measure collaborative competence in LLM agents.
arXiv preprint arXiv:2502.00640(2025)
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 9roles
background 1polarities
background 1representative citing papers
PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
CogWM is a new LLM user model for evaluating social influence by predicting and tracking cognitive state evolution in dialogues, trained on 150k samples and shown to differentiate AI agents effectively.
Recon scores reasoning traces via action reconstruction fidelity, achieving 54.7% win rate over post-hoc baselines and up to 70% when used to train synthesis models across four domains.
MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin
AI alignment must move beyond assuming users have fully formed goals and instead provide active cognitive support to help form and refine intent over time.
CoFi-PGMA derives a unified counterfactual policy gradient objective based on marginal contribution to correct filtered feedback for both routing and collaborative multi-agent LLM training.
Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.
citing papers explorer
-
MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility
MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin