super hub

Qwen2.5 Technical Report

arXiv preprint arXiv:2412 · 2024 · cs.CL · arXiv 2412.15115

367 Pith papers cite this work. Polarity classification is still indexing.

367 Pith papers citing it

open full Pith review browse 367 citing papers more from arXiv preprint arXiv:2412 arXiv PDF

abstract

In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning. Post-training techniques enhance human preference, and notably improve long text generation, structural data analysis, and instruction following. To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich sizes. Open-weight offerings include base and instruction-tuned models, with quantized versions available. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio. Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and multimodal models.

hub tools

JSON dossier citing papers JSON arXiv source

claims ledger

abstract In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well

authors

arXiv preprint arXiv:2412

co-cited works

representative citing papers

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.

Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this protocol SafeLoRA fails the full-card pass on Gemma-2-2B-it.

FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

cs.AI · 2026-05-11 · conditional · novelty 8.0

FormalRewardBench is the first benchmark for reward models in formal theorem proving, consisting of 250 Lean 4 preference pairs that show frontier LLMs scoring 59.8% while specialized provers score only 24.4%.

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents

cs.LG · 2026-05-09 · unverdicted · novelty 8.0

OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy across multiple agent types and models.

Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media

cs.CV · 2026-05-07 · unverdicted · novelty 8.0 · 2 refs

Creates the first benchmark dataset integrating papers, slides, videos, and presentations for evaluating AI models on fine-grained multimodal correspondences in science.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

GHGbench is a new multi-entity benchmark for company- and building-level carbon emission prediction that shows building tasks are harder, out-of-distribution gaps dominate, and multimodal data aids generalization.

What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation

cs.CL · 2026-05-13 · accept · novelty 7.0

Document-level machine translation followed by segment-level LLM refinement provides the strongest and most stable improvements in literary translation quality, mainly enhancing fluency and style rather than adequacy.

Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.

From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning

cs.LG · 2026-05-13 · conditional · novelty 7.0

AutoSelection discovers data recipes from a 90K instruction pool that outperform full-data training and other selectors on reasoning tasks for SFT across multiple models.

When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

Tool-use agents suffer large accuracy drops from reward and transition perturbations but domain-randomized RL on static perturbations closes about 27% of the unseen transition gap while retaining most clean performance.

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

cs.SE · 2026-05-12 · unverdicted · novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

Deep Minds and Shallow Probes

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

gym-invmgmt: An Open Benchmarking Framework for Inventory Management Methods

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

gym-invmgmt is a new benchmarking framework that evaluates inventory policies across optimization and learning methods, finding stochastic programming strongest among non-oracle approaches and PPO-Transformer best among learned ones in tested scenarios.

Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?

cs.AI · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.

Compander-Aligned Query Geometry for Quantized Zeroth-Order Optimization

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

CAQ-ZO aligns ZO query stencils to compander grids, eliminating query-time residual error and improving NF4 fine-tuning performance on Qwen and Llama models compared to standard quantized baselines.

Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.

Unsupervised Process Reward Models

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.

GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation

cs.SI · 2026-05-11 · unverdicted · novelty 7.0

GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outperforms standard prompting.

Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.

LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.

The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?

cs.AI · 2026-05-10 · unverdicted · novelty 7.0

Language representations serve as the asymptotic attractor for convergence in independently trained multimodal neural networks due to feature density asymmetry.

LEAF-SQL: Level-wise Exploration with Adaptive Fine-graining for Text-to-SQL Skeleton Prediction

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

LEAF-SQL uses level-wise exploration with adaptive fine-graining and dual agents to generate diverse SQL skeletons, reaching 71.6% execution accuracy on the BIRD benchmark and outperforming prior search- and skeleton-based methods.

citing papers explorer

Showing 50 of 367 citing papers.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations cs.CL · 2026-05-12 · unverdicted · none · ref 124 · internal anchor
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.
Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims cs.CR · 2026-05-11 · unverdicted · none · ref 22 · internal anchor
Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this protocol SafeLoRA fails the full-card pass on Gemma-2-2B-it.
FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models cs.AI · 2026-05-11 · conditional · none · ref 2 · internal anchor
FormalRewardBench is the first benchmark for reward models in formal theorem proving, consisting of 250 Lean 4 preference pairs that show frontier LLMs scoring 59.8% while specialized provers score only 24.4%.
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents cs.LG · 2026-05-09 · unverdicted · none · ref 11 · internal anchor
OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy across multiple agent types and models.
Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media cs.CV · 2026-05-07 · unverdicted · none · ref 35 · 2 links · internal anchor
Creates the first benchmark dataset integrating papers, slides, videos, and presentations for evaluating AI models on fine-grained multimodal correspondences in science.
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks cs.CL · 2026-04-19 · unverdicted · none · ref 76 · internal anchor
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
Large Language Diffusion Models cs.CL · 2025-02-14 · unverdicted · none · ref 27 · internal anchor
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction cs.LG · 2026-05-13 · unverdicted · none · ref 44 · internal anchor
GHGbench is a new multi-entity benchmark for company- and building-level carbon emission prediction that shows building tasks are harder, out-of-distribution gaps dominate, and multimodal data aids generalization.
What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation cs.CL · 2026-05-13 · accept · none · ref 24 · internal anchor
Document-level machine translation followed by segment-level LLM refinement provides the strongest and most stable improvements in literary translation quality, mainly enhancing fluency and style rather than adequacy.
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation cs.AI · 2026-05-13 · unverdicted · none · ref 48 · internal anchor
PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning cs.LG · 2026-05-13 · conditional · none · ref 28 · internal anchor
AutoSelection discovers data recipes from a 90K instruction pool that outperform full-data training and other selectors on reasoning tasks for SFT across multiple models.
When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents cs.AI · 2026-05-12 · unverdicted · none · ref 32 · internal anchor
Tool-use agents suffer large accuracy drops from reward and transition perturbations but domain-randomized RL on static perturbations closes about 27% of the unseen transition gap while retaining most clean performance.
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning cs.SE · 2026-05-12 · unverdicted · none · ref 27 · internal anchor
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
Deep Minds and Shallow Probes cs.LG · 2026-05-12 · unverdicted · none · ref 9 · internal anchor
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
gym-invmgmt: An Open Benchmarking Framework for Inventory Management Methods cs.LG · 2026-05-12 · unverdicted · none · ref 13 · 2 links · internal anchor
gym-invmgmt is a new benchmarking framework that evaluates inventory policies across optimization and learning methods, finding stochastic programming strongest among non-oracle approaches and PPO-Transformer best among learned ones in tested scenarios.
Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games? cs.AI · 2026-05-11 · unverdicted · none · ref 30 · 2 links · internal anchor
VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.
Compander-Aligned Query Geometry for Quantized Zeroth-Order Optimization cs.LG · 2026-05-11 · unverdicted · none · ref 22 · internal anchor
CAQ-ZO aligns ZO query stencils to compander grids, eliminating query-time residual error and improving NF4 fine-tuning performance on Qwen and Llama models compared to standard quantized baselines.
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents cs.AI · 2026-05-11 · unverdicted · none · ref 33 · internal anchor
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
Unsupervised Process Reward Models cs.LG · 2026-05-11 · unverdicted · none · ref 47 · internal anchor
Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation cs.SI · 2026-05-11 · unverdicted · none · ref 22 · internal anchor
GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outperforms standard prompting.
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations cs.AI · 2026-05-11 · unverdicted · none · ref 43 · internal anchor
Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models cs.LG · 2026-05-10 · unverdicted · none · ref 29 · internal anchor
LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.
The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence? cs.AI · 2026-05-10 · unverdicted · none · ref 17 · internal anchor
Language representations serve as the asymptotic attractor for convergence in independently trained multimodal neural networks due to feature density asymmetry.
LEAF-SQL: Level-wise Exploration with Adaptive Fine-graining for Text-to-SQL Skeleton Prediction cs.CL · 2026-05-10 · unverdicted · none · ref 38 · internal anchor
LEAF-SQL uses level-wise exploration with adaptive fine-graining and dual agents to generate diverse SQL skeletons, reaching 71.6% execution accuracy on the BIRD benchmark and outperforming prior search- and skeleton-based methods.
Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs cs.CL · 2026-05-10 · unverdicted · none · ref 14 · internal anchor
LLMs encode repeated token counts correctly in residual streams but a format-triggered MLP at 88-93% depth overwrites it with an incorrect fixed value.
ProactBench: Beyond What The User Asked For cs.LG · 2026-05-09 · unverdicted · none · ref 141 · internal anchor
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
Test-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures cs.LG · 2026-05-09 · unverdicted · none · ref 17 · internal anchor
Test-time scaling for personalized LLMs follows a logarithmic utility curve under oracle selection but standard reward models suffer user-level collapse and query-level hacking; a probabilistic reward model with learned variance enables consistent scaling.
EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints cs.CL · 2026-05-09 · unverdicted · none · ref 30 · internal anchor
EdgeFlowerTune is a real-device benchmark that jointly assesses model quality and system costs for federated LLM fine-tuning on edge hardware using three protocols: Quality-under-Budget, Cost-to-Target, and Robustness.
The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play cs.AI · 2026-05-08 · unverdicted · none · ref 26 · internal anchor
Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2.5 models.
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration cs.CR · 2026-05-08 · conditional · none · ref 51 · internal anchor
A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs cs.CL · 2026-05-08 · unverdicted · none · ref 31 · internal anchor
RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.
Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR cs.LG · 2026-05-08 · unverdicted · none · ref 15 · internal anchor
HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation cs.CL · 2026-05-08 · unverdicted · none · ref 49 · internal anchor
MoLF routes updates between full fine-tuning and LoRA at the optimizer level to match or exceed the better of either static method, with an efficient LoRA-only variant outperforming prior adaptive approaches.
Regulating Branch Parallelism in LLM Serving cs.DC · 2026-05-07 · unverdicted · none · ref 26 · internal anchor
TAPER regulates LLM branch parallelism by admitting extra branches opportunistically when predicted externality fits slack, delivering 1.48-1.77x higher goodput than eager or fixed-cap baselines on Qwen3-32B while keeping over 95% SLO attainment.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 40 · internal anchor
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
Long Context Pre-Training with Lighthouse Attention cs.CL · 2026-05-07 · conditional · none · ref 27 · internal anchor
Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower loss than standard full-attention training.
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG cs.CL · 2026-05-07 · unverdicted · none · ref 79 · internal anchor
LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost cs.AI · 2026-05-07 · conditional · none · ref 141 · internal anchor
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Stateful Agent Backdoor cs.CR · 2026-05-07 · unverdicted · none · ref 27 · internal anchor
A stateful backdoor for LLM agents, modeled as a Mealy machine with a decomposition framework, enables incremental malicious actions across sessions and achieves 80-95% attack success rate on four models.
Logic-Regularized Verifier Elicits Reasoning from LLMs cs.CL · 2026-05-07 · unverdicted · none · ref 67 · internal anchor
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL cs.LG · 2026-05-07 · conditional · none · ref 11 · internal anchor
A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.
More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding cs.AI · 2026-05-07 · conditional · none · ref 25 · internal anchor
Full factorial testing of five LLM agent components reveals that the complete 'All-In' combination is consistently outperformed by smaller subsets due to cross-component interference, with optimal subsets being task- and scale-dependent.
The First Token Knows: Single-Decode Confidence for Hallucination Detection cs.CL · 2026-05-06 · unverdicted · none · ref 11 · internal anchor
First-token normalized entropy (phi_first) from one greedy decode reaches mean AUROC 0.820 for hallucination detection, matching or exceeding semantic self-consistency (0.793) and surface self-consistency (0.791) across three 7-8B models and two benchmarks.
StoryAlign: Evaluating and Training Reward Models for Story Generation cs.CL · 2026-05-06 · unverdicted · none · ref 23 · internal anchor
StoryReward, trained on a new 100k story preference dataset, sets state-of-the-art performance on the introduced StoryRMB benchmark for aligning LLM stories with human preferences.
Task Vector Geometry Underlies Dual Modes of Task Inference in Transformers cs.LG · 2026-05-05 · unverdicted · none · ref 32 · internal anchor
In a controlled synthetic setting, transformers implement in-distribution task inference via convex combinations of task vectors and out-of-distribution inference via nearly orthogonal extrapolative representations.
FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models cs.AI · 2026-05-05 · conditional · none · ref 23 · internal anchor
FinSTaR reaches 78.9% accuracy on a new financial time series reasoning benchmark by applying Compute-in-CoT for deterministic assessments and Scenario-Aware CoT for stochastic predictions.
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing cs.CV · 2026-05-05 · unverdicted · none · ref 39 · 2 links · internal anchor
VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs cs.PF · 2026-05-04 · unverdicted · none · ref 5 · 2 links · internal anchor
Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.
Needle-in-RAG: Prompt-Conditioned Character-Level Traceback of Poisoned Spans in Retrieved Evidence cs.CR · 2026-05-03 · unverdicted · none · ref 55 · internal anchor
RAGCharacter localizes poisoned character spans in RAG evidence via prompt-conditioned counterfactual masking and achieves the best accuracy-over-attribution trade-off across tested attacks and models.
MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation cs.CL · 2026-05-02 · unverdicted · none · ref 73 · internal anchor
MTA improves LLM knowledge distillation by aligning representations along layer-wise trajectories with adaptive granularity from words to phrases using dynamic structural and hidden representation alignment losses.