ABRA shows radiology agents excel at tool execution (89%+) but struggle with outcomes (0-25%), with oracle perception raising outcomes to 69-100%, identifying perception as the primary bottleneck.
Title resolution pending
12 Pith papers cite this work. Polarity classification is still indexing.
years
2026 12representative citing papers
Containment verification proves that an agentic framework can enforce safety boundaries against any output from an unconstrained AI model by mechanized forward-simulation refinement in Dafny.
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
ASIA uses an LLM-based coding agent to autonomously perform system identification, tested empirically on two benchmarks while noting limitations in transparency and reproducibility.
A lightweight neural dual predictor accelerates exact LAP solvers by over 2x on synthetic data and 1.25-1.5x on real MOT and LPT tasks while preserving full optimality and scaling to N=16384.
External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.
Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
PipeSD achieves 1.16x-2.16x speedup and 14.3%-25.3% lower energy use in cloud-edge LLM inference via token-batch pipeline scheduling optimized by dynamic programming and a Bayesian-optimized dual-threshold NAV trigger.
Memory Inception is a training-free method that injects latent KV banks at chosen layers to steer LLMs, achieving superior control-drift balance and up to 118x storage reduction on personality and structured-reasoning tasks.
Poetic jailbreaks succeed because they induce distinct attention patterns in LLMs that are independent of harmful-content detection, not because models fail to recognize literary formatting.
The paper delivers a unified executable benchmarking suite for tool-using agents that enforces a shared evidence-admission contract across web, code, and micro-task environments.
Software engineering is transitioning from code-centric authorship to intent-centric supervision of human-agent systems, where specification, verification, security, and governance become central.
citing papers explorer
-
ABRA: Agent Benchmark for Radiology Applications
ABRA shows radiology agents excel at tool execution (89%+) but struggle with outcomes (0-25%), with oracle perception raising outcomes to 69-100%, identifying perception as the primary bottleneck.
-
Containment Verification: AI Safety Guarantees Independent of Alignment
Containment verification proves that an agentic framework can enforce safety boundaries against any output from an unconstrained AI model by mechanized forward-simulation refinement in Dafny.
-
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
-
ASIA: an Autonomous System Identification Agent
ASIA uses an LLM-based coding agent to autonomously perform system identification, tested empirically on two benchmarks while noting limitations in transparency and reproducibility.
-
Learning-Augmented Scalable Linear Assignment Problem Optimization via Neural Dual Warm-Starts
A lightweight neural dual predictor accelerates exact LAP solvers by over 2x on synthetic data and 1.25-1.5x on real MOT and LPT tasks while preserving full optimality and scaling to N=16384.
-
Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design
External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.
-
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
-
PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding
PipeSD achieves 1.16x-2.16x speedup and 14.3%-25.3% lower energy use in cloud-edge LLM inference via token-batch pipeline scheduling optimized by dynamic programming and a Bayesian-optimized dual-threshold NAV trigger.
-
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
Memory Inception is a training-free method that injects latent KV banks at chosen layers to steer LLMs, achieving superior control-drift balance and up to 118x storage reduction on personality and structured-reasoning tasks.
-
Metaphor Is Not All Attention Needs
Poetic jailbreaks succeed because they induce distinct attention patterns in LLMs that are independent of harmful-content detection, not because models fail to recognize literary formatting.
-
An Executable Benchmarking Suite for Tool-Using Agents
The paper delivers a unified executable benchmarking suite for tool-using agents that enforces a shared evidence-admission contract across web, code, and micro-task environments.
-
From Code-Centric to Intent-Centric Software Engineering: A Reflexive Thematic Analysis of Generative AI, Agentic Systems, and Engineering Accountability
Software engineering is transitioning from code-centric authorship to intent-centric supervision of human-agent systems, where specification, verification, security, and governance become central.