ABRA shows radiology agents excel at tool execution (89%+) but struggle with outcomes (0-25%), with oracle perception raising outcomes to 69-100%, identifying perception as the primary bottleneck.
Title resolution pending
50 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
The paper claims the first deductive formal verification of an agentic LLM framework in Dafny, proving containment guarantees for boundary policies under havoc oracle semantics independent of model alignment.
Post-2021 US housing seasonality shifted from summer to spring because residential mobility moved earlier, as documented in SIPP data and reproduced by a calibrated monthly search-and-matching model.
Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.
Chronicle is the first model jointly pretrained from scratch on text and time series in a unified transformer that matches a comparable language model on NLU tasks and sets new bars for time series classification and multimodal forecasting.
PROTEA supplies an offline interface for scoring intermediate outputs in multi-agent LLM workflows, performing backward evaluation from final answers, and iterating on targeted prompt revisions with visible score changes.
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.
DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.
AI Overviews boost Reddit engagement in safe communities by 12% but conversational AI Mode reverses gains for experience-based content.
Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
ASIA uses an LLM-based coding agent to autonomously perform system identification, tested empirically on two benchmarks while noting limitations in transparency and reproducibility.
A lightweight neural dual predictor accelerates exact LAP solvers by over 2x on synthetic data and 1.25-1.5x on real MOT and LPT tasks while preserving full optimality and scaling to N=16384.
External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.
LLM agents exhibit constraint decay with assertion pass rates dropping substantially as structural requirements increase in multi-file backend code generation across web frameworks.
The authors propose an S-MILP framework that optimizes group sequential testing boundaries to achieve faster rejection of the null hypothesis compared to traditional methods while controlling type I and type II errors.
VT-Bench aggregates 14 datasets from 9 domains and evaluates 23 models to standardize visual-tabular discriminative and generative tasks.
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
Introduces MPT benchmark and PRefine method that models user preferences as evolving hypotheses to improve personalized tool calling accuracy with 1.24% of full-history token cost.
HealthCraft is the first public RL safety environment for emergency medicine that evaluates frontier LLMs on trajectory-level safety with a dual-layer rubric, showing low multi-step performance and high safety failure rates.
Introduces a multi-resolution spatial partitioning and scan statistic method to detect unfairness in predictive models based on movement patterns, validated as effective on synthetic datasets.
Controlled minimal-pair experiments on six repository pairs show code cleanliness leaves agent task success unchanged but cuts token use by 7-8% and file revisits by 34%.
DMN achieves over 90% attack success rate on GPT-4o, Gemini-2.5-pro and Claude Sonnet 4 by distributing instructions, supplying multimodal evidence, and adding number chain tasks across multiple images.
citing papers explorer
-
Containment Verification: AI Safety Guarantees Independent of Alignment
The paper claims the first deductive formal verification of an agentic LLM framework in Dafny, proving containment guarantees for boundary policies under havoc oracle semantics independent of model alignment.
-
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.
-
ASIA: an Autonomous System Identification Agent
ASIA uses an LLM-based coding agent to autonomously perform system identification, tested empirically on two benchmarks while noting limitations in transparency and reproducibility.
-
ColPackAgent: Agent-Skill-Guided Hard-Particle Monte Carlo Workflows for Colloidal Packing
ColPackAgent integrates a custom colpack Python package wrapping HOOMD-blue with MCP tools and an agent skill to enable reliable autonomous workflows for colloidal packing simulations across interactive, prompt-driven, and autoresearch modes.
-
Mitigating Cognitive Bias in RLHF by Altering Rationality
Dynamically adjusting beta via LLM-as-judge downweights biased comparisons to learn more rational reward models from flawed human preferences.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
LLM Safety From Within: Detecting Harmful Content with Internal Representations
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
-
ADR: An Agentic Detection System for Enterprise Agentic AI Security
ADR is a three-component detection system for AI agents that combines telemetry sensors, red teaming, and two-tier detection, achieving 97.2% precision in a ten-month Uber deployment and outperforming baselines on the new ADR-Bench.