Title resolution pending

· 2026

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

ABRA: Agent Benchmark for Radiology Applications

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

ABRA shows radiology agents excel at tool execution (89%+) but struggle with outcomes (0-25%), with oracle perception raising outcomes to 69-100%, identifying perception as the primary bottleneck.

Containment Verification: AI Safety Guarantees Independent of Alignment

cs.AI · 2026-05-09 · unverdicted · novelty 8.0

Containment verification proves that an agentic framework can enforce safety boundaries against any output from an unconstrained AI model by mechanized forward-simulation refinement in Dafny.

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.

CrackMeBench: Binary Reverse Engineering for Agents

cs.SE · 2026-05-11 · accept · novelty 7.0

CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.

Multi-domain Multi-modal Document Classification Benchmark with a Multi-level Taxonomy

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

MMM-Bench is the first benchmark with a 5-level taxonomy, 5,990 multi-modal documents from 12 commercial domains, expert hierarchical annotations, and baselines that reveal four key challenges in multi-domain document classification.

Evolutionary Ensemble of Agents

cs.NE · 2026-05-09 · unverdicted · novelty 7.0

EvE uses co-evolving populations of solvers and guidance states with Elo-based evaluation to autonomously discover a rescale-then-interpolate mechanism for better generalization in In-Context Operator Networks.

FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

FraudBench shows that current multimodal LLMs and specialized AI-image detectors often fail to spot AI-generated fake damage in refund evidence, with true positive rates frequently below 50% on synthetic subsets while producing false positives on real damage.

GPROF-IR: An Improved Single-Channel Infrared Precipitation Retrieval for Merged Satellite Precipitation Products

physics.ao-ph · 2026-05-08 · unverdicted · novelty 7.0

GPROF-IR is a CNN-based retrieval that uses temporal context in geostationary IR observations to produce precipitation estimates with lower error than prior IR methods and climatological consistency with PMW retrievals for integration into IMERG V08.

The Evaluation Differential: When Frontier AI Models Recognise They Are Being Tested

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Frontier AI models can detect evaluation settings and alter their behavior, so standard test scores do not reliably support safety conclusions.

Muon Does Not Converge on Convex Lipschitz Functions

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.

An Executable Benchmarking Suite for Tool-Using Agents

cs.SE · 2026-05-10 · unverdicted · novelty 5.0

The paper delivers a unified executable benchmarking suite for tool-using agents that enforces a shared evidence-admission contract across web, code, and micro-task environments.

From Code-Centric to Intent-Centric Software Engineering: A Reflexive Thematic Analysis of Generative AI, Agentic Systems, and Engineering Accountability

cs.SE · 2026-05-10 · conditional · novelty 5.0

Software engineering is transitioning from code-centric authorship to intent-centric supervision of human-agent systems, where specification, verification, security, and governance become central.

SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication

cs.LG · 2026-05-08 · unverdicted · novelty 4.0

SparseRL-Sync achieves lossless weight synchronization in large-scale RL by sending only changed parameters, reducing communication volume by roughly 100x under observed 99%+ element-level sparsity.

citing papers explorer

Showing 13 of 13 citing papers.

ABRA: Agent Benchmark for Radiology Applications cs.CV · 2026-05-11 · unverdicted · none · ref 41
ABRA shows radiology agents excel at tool execution (89%+) but struggle with outcomes (0-25%), with oracle perception raising outcomes to 69-100%, identifying perception as the primary bottleneck.
Containment Verification: AI Safety Guarantees Independent of Alignment cs.AI · 2026-05-09 · unverdicted · partial · ref 20
Containment verification proves that an agentic framework can enforce safety boundaries against any output from an unconstrained AI model by mechanized forward-simulation refinement in Dafny.
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues cs.CL · 2026-05-12 · unverdicted · none · ref 54
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
CrackMeBench: Binary Reverse Engineering for Agents cs.SE · 2026-05-11 · accept · none · ref 24
CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.
Multi-domain Multi-modal Document Classification Benchmark with a Multi-level Taxonomy cs.CL · 2026-05-11 · unverdicted · none · ref 30
MMM-Bench is the first benchmark with a 5-level taxonomy, 5,990 multi-modal documents from 12 commercial domains, expert hierarchical annotations, and baselines that reveal four key challenges in multi-domain document classification.
Evolutionary Ensemble of Agents cs.NE · 2026-05-09 · unverdicted · none · ref 8
EvE uses co-evolving populations of solvers and guidance states with Elo-based evaluation to autonomously discover a rescale-then-interpolate mechanism for better generalization in In-Context Operator Networks.
FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence cs.CV · 2026-05-09 · unverdicted · none · ref 29
FraudBench shows that current multimodal LLMs and specialized AI-image detectors often fail to spot AI-generated fake damage in refund evidence, with true positive rates frequently below 50% on synthetic subsets while producing false positives on real damage.
GPROF-IR: An Improved Single-Channel Infrared Precipitation Retrieval for Merged Satellite Precipitation Products physics.ao-ph · 2026-05-08 · unverdicted · none · ref 165
GPROF-IR is a CNN-based retrieval that uses temporal context in geostationary IR observations to produce precipitation estimates with lower error than prior IR methods and climatological consistency with PMW retrievals for integration into IMERG V08.
The Evaluation Differential: When Frontier AI Models Recognise They Are Being Tested cs.AI · 2026-05-12 · unverdicted · none · ref 3
Frontier AI models can detect evaluation settings and alter their behavior, so standard test scores do not reliably support safety conclusions.
Muon Does Not Converge on Convex Lipschitz Functions cs.LG · 2026-05-09 · unverdicted · none · ref 47
Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.
An Executable Benchmarking Suite for Tool-Using Agents cs.SE · 2026-05-10 · unverdicted · none · ref 7
The paper delivers a unified executable benchmarking suite for tool-using agents that enforces a shared evidence-admission contract across web, code, and micro-task environments.
From Code-Centric to Intent-Centric Software Engineering: A Reflexive Thematic Analysis of Generative AI, Agentic Systems, and Engineering Accountability cs.SE · 2026-05-10 · conditional · none · ref 33
Software engineering is transitioning from code-centric authorship to intent-centric supervision of human-agent systems, where specification, verification, security, and governance become central.
SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication cs.LG · 2026-05-08 · unverdicted · none · ref 26
SparseRL-Sync achieves lossless weight synchronization in large-scale RL by sending only changed parameters, reducing communication volume by roughly 100x under observed 99%+ element-level sparsity.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer