ABRA shows radiology agents excel at tool execution (89%+) but struggle with outcomes (0-25%), with oracle perception raising outcomes to 69-100%, identifying perception as the primary bottleneck.
Title resolution pending
50 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
Containment verification proves that an agentic framework can enforce safety boundaries against any output from an unconstrained AI model by mechanized forward-simulation refinement in Dafny.
Post-2021 US housing seasonality shifted from summer to spring because residential mobility moved earlier, as documented in SIPP data and reproduced by a calibrated monthly search-and-matching model.
Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.
Chronicle is the first model jointly pretrained from scratch on text and time series in a unified transformer that matches a comparable language model on NLU tasks and sets new bars for time series classification and multimodal forecasting.
PROTEA supplies an offline interface for scoring intermediate outputs in multi-agent LLM workflows, performing backward evaluation from final answers, and iterating on targeted prompt revisions with visible score changes.
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.
DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.
Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
ASIA uses an LLM-based coding agent to autonomously perform system identification, tested empirically on two benchmarks while noting limitations in transparency and reproducibility.
A lightweight neural dual predictor accelerates exact LAP solvers by over 2x on synthetic data and 1.25-1.5x on real MOT and LPT tasks while preserving full optimality and scaling to N=16384.
External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.
LLM agents exhibit constraint decay with assertion pass rates dropping substantially as structural requirements increase in multi-file backend code generation across web frameworks.
The authors propose an S-MILP framework that optimizes group sequential testing boundaries to achieve faster rejection of the null hypothesis compared to traditional methods while controlling type I and type II errors.
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
Introduces MPT benchmark and PRefine method that models user preferences as evolving hypotheses to improve personalized tool calling accuracy with 1.24% of full-history token cost.
HealthCraft is the first public RL safety environment for emergency medicine that evaluates frontier LLMs on trajectory-level safety with a dual-layer rubric, showing low multi-step performance and high safety failure rates.
Controlled minimal-pair experiments on six repository pairs show code cleanliness leaves agent task success unchanged but cuts token use by 7-8% and file revisits by 34%.
DMN achieves over 90% attack success rate on GPT-4o, Gemini-2.5-pro and Claude Sonnet 4 by distributing instructions, supplying multimodal evidence, and adding number chain tasks across multiple images.
A perturbation framework with Drop/Add/Flip and player-removal operations demonstrates that Bradley-Terry leaderboards are non-robust to sub-1% targeted changes that alter top ranks, Kendall tau, and confidence intervals.
ColPackAgent integrates a custom colpack Python package wrapping HOOMD-blue with MCP tools and an agent skill to enable reliable autonomous workflows for colloidal packing simulations across interactive, prompt-driven, and autoresearch modes.
Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
citing papers explorer
-
ABRA: Agent Benchmark for Radiology Applications
ABRA shows radiology agents excel at tool execution (89%+) but struggle with outcomes (0-25%), with oracle perception raising outcomes to 69-100%, identifying perception as the primary bottleneck.
-
Containment Verification: AI Safety Guarantees Independent of Alignment
Containment verification proves that an agentic framework can enforce safety boundaries against any output from an unconstrained AI model by mechanized forward-simulation refinement in Dafny.
-
From Summer to Spring: A Shift in US Housing Market Seasonality
Post-2021 US housing seasonality shifted from summer to spring because residential mobility moved earlier, as documented in SIPP data and reproduced by a calibrated monthly search-and-matching model.
-
Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression
Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.
-
Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding
Chronicle is the first model jointly pretrained from scratch on text and time series in a unified transformer that matches a comparable language model on NLU tasks and sets new bars for time series classification and multimodal forecasting.
-
PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows
PROTEA supplies an offline interface for scoring intermediate outputs in multi-agent LLM workflows, performing backward evaluation from final answers, and iterating on targeted prompt revisions with visible score changes.
-
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.
-
DISA: Offline Importance Sampling for Distribution-Matching LLM-RL
DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.
-
Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench
Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.
-
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
-
ASIA: an Autonomous System Identification Agent
ASIA uses an LLM-based coding agent to autonomously perform system identification, tested empirically on two benchmarks while noting limitations in transparency and reproducibility.
-
Learning-Augmented Scalable Linear Assignment Problem Optimization via Neural Dual Warm-Starts
A lightweight neural dual predictor accelerates exact LAP solvers by over 2x on synthetic data and 1.25-1.5x on real MOT and LPT tasks while preserving full optimality and scaling to N=16384.
-
Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design
External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.
-
Constraint Decay: The Fragility of LLM Agents in Backend Code Generation
LLM agents exhibit constraint decay with assertion pass rates dropping substantially as structural requirements increase in multi-file backend code generation across web frameworks.
-
A General Framework for Optimal Group Sequential Testing via Mixed-Integer Linear Programming
The authors propose an S-MILP framework that optimizes group sequential testing boundaries to achieve faster rejection of the null hypothesis compared to traditional methods while controlling type I and type II errors.
-
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
-
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
-
Latent Preference Modeling for Cross-Session Personalized Tool Calling
Introduces MPT benchmark and PRefine method that models user preferences as evolving hypotheses to improve personalized tool calling accuracy with 1.24% of full-history token cost.
-
HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine
HealthCraft is the first public RL safety environment for emergency medicine that evaluates frontier LLMs on trajectory-level safety with a dual-layer rubric, showing low multi-step performance and high safety failure rates.
-
Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study
Controlled minimal-pair experiments on six repository pairs show code cleanliness leaves agent task success unchanged but cuts token use by 7-8% and file revisits by 34%.
-
DMN: A Compositional Framework for Jailbreaking Multimodal LLMs with Multi-Image Inputs
DMN achieves over 90% attack success rate on GPT-4o, Gemini-2.5-pro and Claude Sonnet 4 by distributing instructions, supplying multimodal evidence, and adding number chain tasks across multiple images.
-
A Unified Perturbation Framework for Analyzing Leaderboard Stability and Manipulation
A perturbation framework with Drop/Add/Flip and player-removal operations demonstrates that Bradley-Terry leaderboards are non-robust to sub-1% targeted changes that alter top ranks, Kendall tau, and confidence intervals.
-
ColPackAgent: Agent-Skill-Guided Hard-Particle Monte Carlo Workflows for Colloidal Packing
ColPackAgent integrates a custom colpack Python package wrapping HOOMD-blue with MCP tools and an agent skill to enable reliable autonomous workflows for colloidal packing simulations across interactive, prompt-driven, and autoresearch modes.
-
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
-
Mitigating Cognitive Bias in RLHF by Altering Rationality
Dynamically adjusting beta via LLM-as-judge downweights biased comparisons to learn more rational reward models from flawed human preferences.
-
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
Memory Inception is a training-free method that injects latent KV banks at chosen layers to steer LLMs, achieving superior control-drift balance and up to 118x storage reduction on personality and structured-reasoning tasks.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
Euclid preparation. CosmoPostProcess: A simulation calibrated framework for weak lensing selection bias in richness-selected galaxy clusters
CosmoPostProcess delivers simulation-calibrated radial corrections for projection-induced selection bias (20-40% amplitude near 1 h^{-1} Mpc) and baryonic effects in Euclid richness-selected cluster weak lensing profiles.
-
Monte Carlo PDE Solvers for Nonlinear Radiative Boundary Conditions
A relaxed Picard iteration plus heteroscedastic boundary denoising lets Monte Carlo PDE solvers solve heat equations with nonlinear radiation boundary conditions more accurately than linearization.
-
LLM Safety From Within: Detecting Harmful Content with Internal Representations
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
-
Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non-mathematical reasoning benchmarks.
-
Agentic Education: Using Claude Code to Teach Claude Code
cc-self-train is an adaptive project-based curriculum for mastering Claude Code featuring persona progression from Guide to Launcher, hook-based engagement adaptation, cross-domain unified feature sequencing, explicit pause primitives, and auto-updating on tool changes, with pilot evidence of self-e
-
Evaluating AI Providers' Frontier Safety Frameworks
Twelve frontier AI safety frameworks score between 8% and 34% on adapted risk-management criteria, with a median of 18%, leaving them too vague to serve as reliable external accountability mechanisms.
-
Free surfaces in turbulence -- A unified framework from water surfaces to elastic solids
Linear theory predicts regimes for deformable surfaces in turbulence where the interface is enslaved by flow or shows intrinsic dynamics; simulations of air-water and rubber match predictions without wave turbulence.
-
ADR: An Agentic Detection System for Enterprise Agentic AI Security
ADR is a three-component detection system for AI agents that combines telemetry sensors, red teaming, and two-tier detection, achieving 97.2% precision in a ten-month Uber deployment and outperforming baselines on the new ADR-Bench.
-
Metaphor Is Not All Attention Needs
Poetic jailbreaks succeed because they induce distinct attention patterns in LLMs that are independent of harmful-content detection, not because models fail to recognize literary formatting.
-
An Executable Benchmarking Suite for Tool-Using Agents
The paper delivers a unified executable benchmarking suite for tool-using agents that enforces a shared evidence-admission contract across web, code, and micro-task environments.
-
From Code-Centric to Intent-Centric Software Engineering: A Reflexive Thematic Analysis of Generative AI, Agentic Systems, and Engineering Accountability
Software engineering is transitioning from code-centric authorship to intent-centric supervision of human-agent systems, where specification, verification, security, and governance become central.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot success on LIBERO-Plus.
-
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.
-
Retrieval-Augmented Large Language Models for Schema-Constrained Clinical Information Extraction
A modular RAG pipeline with schema-constrained prompting, deterministic post-processing, and second-pass auditing reaches 80.36% F1 on observation extraction from nurse-patient transcripts using GPT-5.2.
-
Automatic Code and Test Generation of Smart Contracts from Coordination Models
A formal coordination model with dynamic roles and data-driven transitions enables a toolchain for validating models, generating Solidity smart contracts, and synthesizing tests.
-
Retrieve Only Relevant Tables Whether Few or Many: Adaptive Table Retrieval Method
An adaptive thresholding mechanism combined with sliding-window reranking retrieves a query-dependent number of tables from large corpora, improving retrieval and downstream text-to-SQL performance on Spider, BIRD, and Spider 2.0.
-
Internal Deployment in the AI Act
Interpretations of Articles 2(1), 2(6), and 2(8) of the AI Act support applying the regulation to internal AI deployment while allowing for R&D exceptions, with the provisions viewed as complementary.
- Assessing Predictive Models for Fairness Based on Movement Patterns
- Against the Monolithic Wireless World Model: Why NextG Needs Composable and Agentic Intelligence
- Voice "Cloning" is Style Transfer
- The Impact of AI Search on the Online Content Ecosystem: Evidence from Google and Reddit
- PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding
- VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning