MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.
Qwen3.5 : Towards native multimodal agents, February 2026
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6verdicts
UNVERDICTED 6representative citing papers
F-GRPO factorizes group-relative policy optimization into generation and ranking phases within one autoregressive sequence, using order-invariant coverage and position-aware utility rewards to improve top-ranked performance on recommendation and multi-hop QA tasks.
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
ATBench is a new trajectory-level benchmark with 1,000 diverse and realistic scenarios for assessing safety in LLM agents.
citing papers explorer
-
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
ATBench is a new trajectory-level benchmark with 1,000 diverse and realistic scenarios for assessing safety in LLM agents.