MI-CXR is a new benchmark that shows state-of-the-art vision-language models achieve only 29.3% accuracy on longitudinal reasoning tasks across multi-visit chest X-ray sequences.
Mixed citations
Title resolution pending
Mixed citation behavior. Most common role is background (50%).
citation-role summary
citation-polarity summary
representative citing papers
Pre-training 6B LLMs on temporally ordered Common Crawl snapshots yields models with improved factual freshness and temporal precision over shuffled baselines while matching on general language understanding.
Life-Harness evolves reusable interventions from training trajectories to enhance frozen LLM agents on unseen tasks across seven deterministic environments, yielding 88.5% average relative improvement in 116 of 126 model-environment settings.
A new 263k TTCW-annotated story dataset shows non-reasoning fine-tuning of Qwen3 models outperforms reasoning-supervised fine-tuning for fixed-format long-form literary review generation.
RISE is an inference-time semantic reranking framework that refines low-confidence predictions in rhetorical role labeling using contrastively learned label representations, delivering an average +9.15 macro-F1 gain on hard examples across eight datasets and seven models.
Introduces the UCSF-PDGM-VQA dataset of 2387 QA pairs from 473 glioma MRI studies and demonstrates that state-of-the-art VLMs exhibit modality collapse on multi-sequence 3D medical images.
Rule2DRC is a benchmark for LLM agents synthesizing DRC scripts from natural language rules, paired with SplitTester that improves Best-of-N selection via execution-guided discriminative test generation.
Proves that RoPE attention loses locality bias and token distinction in long contexts, approaching random behavior independent of content.
QAOD projects away question-aligned directions from answer representations to isolate domain-agnostic factuality signals, enabling efficient hallucination detection with top in-domain AUROC and up to 21% better OOD transfer.
DiHAL uses geometry proxies to pick where to replace the lower layers of a pretrained transformer with a diffusion bridge for hidden-state reconstruction, improving over token-level diffusion baselines on 8B models.
Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.
TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.
AcquisitionSynthesis uses acquisition functions as rewards to train generators that produce higher-quality synthetic data, delivering 2-7% gains on math, medical QA, and coding tasks with improved robustness to forgetting.
OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.
MMM-Bench supplies 5,990 multi-modal documents from 12 commercial domains annotated along a 5-level taxonomy to test document classification under realistic business conditions.
Fully Looped Transformer stabilizes looped training up to 12 iterations via distributed inter-loop signals and attention injection, improving downstream performance by up to 13.2%.
ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.
LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
Agentick is a new benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human approaches across 37 tasks and finds no single method dominates.
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Finite-answer projections of continuation probabilities stabilize before the answer is parseable, showing 17-31 token mean lead in delayed-verdict tasks with Qwen3-4B-Instruct.
Prepending stochastic sequences from Lorem Ipsum vocabulary to prompts during GRPO resampling broadens reasoning exploration and outperforms standard resampling on hard tasks for 1.7B-7B models.
citing papers explorer
-
OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning
OGLS-SD improves on-policy self-distillation stability and math reasoning performance by constructing an outcome-discriminative steering direction from contrasts between successful and failed teacher logits.
-
MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety
MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.
-
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.