Title resolution pending

· 2025

50 Pith papers cite this work. Polarity classification is still indexing.

50 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

ABRA: Agent Benchmark for Radiology Applications

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

ABRA shows radiology agents excel at tool execution (89%+) but struggle with outcomes (0-25%), with oracle perception raising outcomes to 69-100%, identifying perception as the primary bottleneck.

Containment Verification: AI Safety Guarantees Independent of Alignment

cs.AI · 2026-05-09 · unverdicted · novelty 8.0

The paper claims the first deductive formal verification of an agentic LLM framework in Dafny, proving containment guarantees for boundary policies under havoc oracle semantics independent of model alignment.

From Summer to Spring: A Shift in US Housing Market Seasonality

econ.GN · 2026-05-20 · unverdicted · novelty 7.0

Post-2021 US housing seasonality shifted from summer to spring because residential mobility moved earlier, as documented in SIPP data and reproduced by a calibrated monthly search-and-matching model.

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.

Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Chronicle is the first model jointly pretrained from scratch on text and time series in a unified transformer that matches a comparable language model on NLU tasks and sets new bars for time series classification and multimodal forecasting.

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

cs.CL · 2026-05-18 · conditional · novelty 7.0

PROTEA supplies an offline interface for scoring intermediate outputs in multi-agent LLM workflows, performing backward evaluation from final answers, and iterating on targeted prompt revisions with visible score changes.

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

cs.AI · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.

The Impact of AI Search on the Online Content Ecosystem: Evidence from Google and Reddit

cs.IR · 2026-05-14 · unverdicted · novelty 7.0

AI Overviews boost Reddit engagement in safe communities by 12% but conversational AI Mode reverses gains for experience-based content.

Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench

cs.AR · 2026-05-13 · unverdicted · novelty 7.0

Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.

ASIA: an Autonomous System Identification Agent

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

ASIA uses an LLM-based coding agent to autonomously perform system identification, tested empirically on two benchmarks while noting limitations in transparency and reproducibility.

Learning-Augmented Scalable Linear Assignment Problem Optimization via Neural Dual Warm-Starts

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

A lightweight neural dual predictor accelerates exact LAP solvers by over 2x on synthetic data and 1.25-1.5x on real MOT and LPT tasks while preserving full optimality and scaling to N=16384.

Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design

cs.MA · 2026-05-09 · unverdicted · novelty 7.0

External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.

Constraint Decay: The Fragility of LLM Agents in Backend Code Generation

cs.SE · 2026-05-07 · unverdicted · novelty 7.0

LLM agents exhibit constraint decay with assertion pass rates dropping substantially as structural requirements increase in multi-file backend code generation across web frameworks.

A General Framework for Optimal Group Sequential Testing via Mixed-Integer Linear Programming

stat.ME · 2026-05-05 · unverdicted · novelty 7.0

The authors propose an S-MILP framework that optimizes group sequential testing boundaries to achieve faster rejection of the null hypothesis compared to traditional methods while controlling type I and type II errors.

VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning

cs.CV · 2026-05-03 · unverdicted · novelty 7.0

VT-Bench aggregates 14 datasets from 9 domains and evaluates 23 models to standardize visual-tabular discriminative and generative tasks.

PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

cs.LG · 2026-04-23 · unverdicted · novelty 7.0

Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

eess.AS · 2026-04-21 · unverdicted · novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.

Latent Preference Modeling for Cross-Session Personalized Tool Calling

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

Introduces MPT benchmark and PRefine method that models user preferences as evolving hypotheses to improve personalized tool calling accuracy with 1.24% of full-history token cost.

HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

cs.LG · 2026-04-18 · unverdicted · novelty 7.0

HealthCraft is the first public RL safety environment for emergency medicine that evaluates frontier LLMs on trajectory-level safety with a dual-layer rubric, showing low multi-step performance and high safety failure rates.

Assessing Predictive Models for Fairness Based on Movement Patterns

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Introduces a multi-resolution spatial partitioning and scan statistic method to detect unfairness in predictive models based on movement patterns, validated as effective on synthetic datasets.

Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study

cs.SE · 2026-05-19 · unverdicted · novelty 6.0

Controlled minimal-pair experiments on six repository pairs show code cleanliness leaves agent task success unchanged but cuts token use by 7-8% and file revisits by 34%.

DMN: A Compositional Framework for Jailbreaking Multimodal LLMs with Multi-Image Inputs

cs.CR · 2026-05-18 · unverdicted · novelty 6.0

DMN achieves over 90% attack success rate on GPT-4o, Gemini-2.5-pro and Claude Sonnet 4 by distributing instructions, supplying multimodal evidence, and adding number chain tasks across multiple images.

citing papers explorer

Showing 8 of 8 citing papers after filters.

Containment Verification: AI Safety Guarantees Independent of Alignment cs.AI · 2026-05-09 · unverdicted · partial · ref 18
The paper claims the first deductive formal verification of an agentic LLM framework in Dafny, proving containment guarantees for boundary policies under havoc oracle semantics independent of model alignment.
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain cs.AI · 2026-05-18 · unverdicted · none · ref 31 · 2 links
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.
ASIA: an Autonomous System Identification Agent cs.AI · 2026-05-11 · unverdicted · none · ref 12
ASIA uses an LLM-based coding agent to autonomously perform system identification, tested empirically on two benchmarks while noting limitations in transparency and reproducibility.
ColPackAgent: Agent-Skill-Guided Hard-Particle Monte Carlo Workflows for Colloidal Packing cs.AI · 2026-05-15 · unverdicted · none · ref 19
ColPackAgent integrates a custom colpack Python package wrapping HOOMD-blue with MCP tools and an agent skill to enable reliable autonomous workflows for colloidal packing simulations across interactive, prompt-driven, and autoresearch modes.
Mitigating Cognitive Bias in RLHF by Altering Rationality cs.AI · 2026-05-07 · unverdicted · none · ref 39
Dynamically adjusting beta via LLM-as-judge downweights biased comparisons to learn more rational reward models from flawed human preferences.
ZAYA1-8B Technical Report cs.AI · 2026-05-06 · unverdicted · none · ref 98
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
LLM Safety From Within: Detecting Harmful Content with Internal Representations cs.AI · 2026-04-20 · unverdicted · none · ref 59
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
ADR: An Agentic Detection System for Enterprise Agentic AI Security cs.AI · 2026-05-17 · unverdicted · none · ref 13
ADR is a three-component detection system for AI agents that combines telemetry sensors, red teaming, and two-tier detection, achieving 97.2% precision in a ten-month Uber deployment and outperforming baselines on the new ADR-Bench.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer