CRAB-Bench and RUSE create a new evaluation framework for LLM agents on constraint-graph tasks with realistic human-like user behaviors, reporting 61% pass@1 for the best model and up to 57% further drops under RUSE.
A survey on multi-turn interaction capabilities of large language models
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
ReCrit frames critic interaction as a correctness-transition problem and uses quadrant-based RL rewards to improve LLM performance on scientific reasoning benchmarks by rewarding corrections and robustness while penalizing sycophancy.
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
A framework integrates user language and probabilistic environment estimates into adaptive safety certificates that guarantee long-term safety for stochastic systems via probabilistic invariance.
The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.
Proposes a multi-layer framework and agent architecture that operationalizes adaptation, coherence, continuity, and agency for longitudinal health AI agents.
citing papers explorer
-
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.