Title resolution pending

· 2026

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

browse 9 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench

cs.AR · 2026-05-13 · unverdicted · novelty 7.0

Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.

ProactBench: Beyond What The User Asked For

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.

EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments

cs.AI · 2026-07-02 · unverdicted · novelty 6.0

EvoPolicyGym is a new benchmark suite of 16 compact RL environments that evaluates autonomous policy evolution, with GPT-5.5 achieving the top aggregate rank and top-two performance on all tasks.

Representation Distribution Matching for One-Step Visual Generation

cs.CV · 2026-07-02 · unverdicted · novelty 6.0

RDM trains one-step generators via MMD on large batches and multi-encoder representations, achieving SOTA SW_r14 of 1.30 on ImageNet and distilling FLUX.2 to one-step with gains on GenEval and PickScore.

Classifier Context Rot: Monitor Performance Degrades with Context Length

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Frontier LLMs miss dangerous actions in long coding agent transcripts 2-30 times more often after hundreds of thousands of benign tokens.

UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification

cs.AI · 2026-05-10 · unverdicted · novelty 6.0 · 2 refs

A deliberative council of Gemini agents using absence-based clinical rules achieves 0.382 F1 without fine-tuning and second place overall at 0.406 F1 on defense mechanism classification, with minority-class overrides adding 2.4pp.

Assert, don't describe: Linguistic features that shift LLM reasoning about animal welfare

cs.CL · 2026-04-30 · unverdicted · novelty 5.0

Assertive linguistic features in training data increase LLMs' pro-animal-welfare reasoning while hedged and sensory-description features decrease it.

Repair the Amplifier, Not the Symptom: Stable World-Model Correction for Agent Rollouts

cs.AI · 2026-07-02

citing papers explorer

Showing 9 of 9 citing papers.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild cs.CV · 2026-05-07 · unverdicted · none · ref 92
MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench cs.AR · 2026-05-13 · unverdicted · none · ref 48
Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.
ProactBench: Beyond What The User Asked For cs.LG · 2026-05-09 · unverdicted · none · ref 80
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments cs.AI · 2026-07-02 · unverdicted · none · ref 27
EvoPolicyGym is a new benchmark suite of 16 compact RL environments that evaluates autonomous policy evolution, with GPT-5.5 achieving the top aggregate rank and top-two performance on all tasks.
Representation Distribution Matching for One-Step Visual Generation cs.CV · 2026-07-02 · unverdicted · none · ref 27
RDM trains one-step generators via MMD on large batches and multi-encoder representations, achieving SOTA SW_r14 of 1.30 on ImageNet and distilling FLUX.2 to one-step with gains on GenEval and PickScore.
Classifier Context Rot: Monitor Performance Degrades with Context Length cs.AI · 2026-05-12 · unverdicted · none · ref 19
Frontier LLMs miss dangerous actions in long coding agent transcripts 2-30 times more often after hundreds of thousands of benign tokens.
UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification cs.AI · 2026-05-10 · unverdicted · none · ref 11 · 2 links
A deliberative council of Gemini agents using absence-based clinical rules achieves 0.382 F1 without fine-tuning and second place overall at 0.406 F1 on defense mechanism classification, with minority-class overrides adding 2.4pp.
Assert, don't describe: Linguistic features that shift LLM reasoning about animal welfare cs.CL · 2026-04-30 · unverdicted · none · ref 3
Assertive linguistic features in training data increase LLMs' pro-animal-welfare reasoning while hedged and sensory-description features decrease it.
Repair the Amplifier, Not the Symptom: Stable World-Model Correction for Agent Rollouts cs.AI · 2026-07-02 · unreviewed · ref 15

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer