super hub

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bowen Yu, Bo Zheng, Chang Zhou · 2024 · cs.CL · arXiv 2407.10671

156 Pith papers cite this work. Polarity classification is still indexing.

156 Pith papers citing it

open full Pith review browse 156 citing papers more from An Yang arXiv PDF

abstract

This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The flagship model, Qwen2-72B, showcases remarkable performance: 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, attains 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Moreover, Qwen2 demonstrates robust multilingual capabilities, proficient in approximately 30 languages, spanning English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, Vietnamese, and more, underscoring its versatility and global reach. To foster community innovation and accessibility, we have made the Qwen2 model weights openly available on Hugging Face and ModelScope, and the supplementary materials including example code on GitHub. These platforms also include resources for quantization, fine-tuning, and deployment, facilitating a wide range of applications and research endeavors.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 dataset 1 method 1

citation-polarity summary

background 1 use dataset 1 use method 1

claims ledger

abstract This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The

authors

An Yang Baosong Yang Binyuan Hui Bowen Yu Bo Zheng Chang Zhou

co-cited works

representative citing papers

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks

cs.AI · 2026-04-01 · unverdicted · novelty 8.0

AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

cs.DC · 2026-05-13 · conditional · novelty 7.0

KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.

Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment performance.

TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.

Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Projecting LLM hidden states onto F2 algebra with 42 pairs yields 93% zero-shot accuracy on logical relations and identifies prompt-preventable late-layer collapse.

Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

Large language models achieve macro F1 scores above 0.85 on binary nominal-versus-danger classification from CTAF radio transcripts and METAR weather data using a new synthetic dataset with a 12-category hazard taxonomy.

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-tuning tasks.

Privacy-preserving Chunk Scheduling in a BitTorrent Implementation of Federated Learning

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

FLTorrent achieves within-round source unlinkability in decentralized federated learning via a BitTorrent warm-up with pre-round obfuscation, randomized lags, and coordination-only non-owner-first scheduling, reaching 92% of bandwidth-optimal efficiency and near-random attribution success.

GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation

cs.SI · 2026-05-11 · unverdicted · novelty 7.0

GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outperforms standard prompting.

Beyond Position Bias: Shifting Context Compression from Position-Driven to Semantic-Driven

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

SeCo performs semantic-driven context compression for LLMs by anchoring on query-relevant semantic centers and applying consistency-weighted token merging, yielding better downstream performance, lower latency, and stronger out-of-domain robustness than position-based methods across 14 benchmarks.

Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models

cs.RO · 2026-05-09 · unverdicted · novelty 7.0

GuardVLA embeds a stealthy backdoor watermark in VLAs via secret messages in visual data and uses a swap-and-detect mechanism for post-release ownership verification that preserves task performance.

Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.

Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding better performance than scratch training.

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.

Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.

Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

GLoRA replaces raw factor averaging with gauge-aware aggregation in a consensus subspace estimated from client projectors, enabling consistent low-rank federated LoRA under heterogeneity.

Logic-Regularized Verifier Elicits Reasoning from LLMs

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

cs.PF · 2026-05-07 · unverdicted · novelty 7.0

A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.

Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.

QASecClaw: A Multi-Agent LLM Approach for False Positive Reduction in Static Application Security Testing

cs.CR · 2026-05-03 · unverdicted · novelty 7.0

A multi-agent LLM system cuts false positives in static application security testing by 88.6% on the OWASP Benchmark while dropping recall by only 3.1%.

InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees

cs.LG · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.

citing papers explorer

Showing 50 of 156 citing papers.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 29 · internal anchor
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks cs.AI · 2026-04-01 · unverdicted · none · ref 28 · internal anchor
AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.
Large Language Diffusion Models cs.CL · 2025-02-14 · unverdicted · none · ref 26 · internal anchor
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark cs.CL · 2024-09-04 · accept · none · ref 54 · internal anchor
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving cs.DC · 2026-05-13 · conditional · none · ref 50 · internal anchor
KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment cs.LG · 2026-05-13 · unverdicted · none · ref 13 · internal anchor
Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment performance.
TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment cs.CL · 2026-05-13 · unverdicted · none · ref 82 · internal anchor
TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2 cs.LG · 2026-05-13 · unverdicted · none · ref 18 · internal anchor
Projecting LLM hidden states onto F2 algebra with 42 pairs yields 93% zero-shot accuracy on logical relations and identifies prompt-preventable late-layer collapse.
Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models cs.AI · 2026-05-12 · unverdicted · none · ref 27 · internal anchor
Large language models achieve macro F1 scores above 0.85 on binary nominal-versus-danger classification from CTAF radio transcripts and METAR weather data using a new synthetic dataset with a 12-category hazard taxonomy.
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control cs.LG · 2026-05-12 · unverdicted · none · ref 37 · internal anchor
Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-tuning tasks.
Privacy-preserving Chunk Scheduling in a BitTorrent Implementation of Federated Learning cs.DC · 2026-05-11 · unverdicted · none · ref 31 · internal anchor
FLTorrent achieves within-round source unlinkability in decentralized federated learning via a BitTorrent warm-up with pre-round obfuscation, randomized lags, and coordination-only non-owner-first scheduling, reaching 92% of bandwidth-optimal efficiency and near-random attribution success.
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation cs.SI · 2026-05-11 · unverdicted · none · ref 21 · internal anchor
GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outperforms standard prompting.
Beyond Position Bias: Shifting Context Compression from Position-Driven to Semantic-Driven cs.CL · 2026-05-10 · unverdicted · none · ref 59 · internal anchor
SeCo performs semantic-driven context compression for LLMs by anchoring on query-relevant semantic centers and applying consistency-weighted token merging, yielding better downstream performance, lower latency, and stronger out-of-domain robustness than position-based methods across 14 benchmarks.
Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models cs.RO · 2026-05-09 · unverdicted · none · ref 16 · internal anchor
GuardVLA embeds a stealthy backdoor watermark in VLAs via secret messages in visual data and uses a swap-and-detect mechanism for post-release ownership verification that preserves task performance.
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms cs.LG · 2026-05-08 · unverdicted · none · ref 25 · internal anchor
Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 97 · internal anchor
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding better performance than scratch training.
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective cs.LG · 2026-05-08 · unverdicted · none · ref 39 · internal anchor
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions cs.CL · 2026-05-08 · unverdicted · none · ref 5 · internal anchor
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA cs.LG · 2026-05-07 · unverdicted · none · ref 22 · internal anchor
GLoRA replaces raw factor averaging with gauge-aware aggregation in a consensus subspace estimated from client projectors, enabling consistent low-rank federated LoRA under heterogeneity.
Logic-Regularized Verifier Elicits Reasoning from LLMs cs.CL · 2026-05-07 · unverdicted · none · ref 48 · internal anchor
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon cs.PF · 2026-05-07 · unverdicted · none · ref 30 · internal anchor
A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation cs.CL · 2026-05-04 · unverdicted · none · ref 35 · internal anchor
Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.
QASecClaw: A Multi-Agent LLM Approach for False Positive Reduction in Static Application Security Testing cs.CR · 2026-05-03 · unverdicted · none · ref 23 · internal anchor
A multi-agent LLM system cuts false positives in static application security testing by 88.6% on the OWASP Benchmark while dropping recall by only 3.1%.
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees cs.LG · 2026-05-01 · unverdicted · none · ref 162 · 2 links · internal anchor
InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.
PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models cs.SE · 2026-04-30 · unverdicted · none · ref 38 · internal anchor
PuzzleMark provides a robust and imperceptible watermarking method for code datasets using adaptive variable name concatenation and statistical verification, achieving perfect detection rates with minimal performance impact.
R-CoT: A Reasoning-Layer Watermark via Redundant Chain-of-Thought in Large Language Models cs.CR · 2026-04-28 · unverdicted · none · ref 21 · internal anchor
R-CoT embeds watermarks into LLM reasoning paths via redundant CoT and GRPO-based dual optimization, maintaining over 95% true positive rate under fine-tuning and post-training changes.
Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity cs.LG · 2026-04-27 · unverdicted · none · ref 19 · internal anchor
Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.
Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines cs.AI · 2026-04-26 · unverdicted · none · ref 58 · internal anchor
A two-agent adversarial rewriting framework achieves 20-40% evasion rates against LLM-based misinformation detectors under strict black-box constraints with binary feedback only, far outperforming prior methods and linking success to specific architectural properties.
Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers cs.LG · 2026-04-26 · unverdicted · none · ref 9 · internal anchor
In LLM feed-forward networks, the top 1% of channels per layer carry a median 58.7% of loss sensitivity, forming supernodes whose protection enables effective 50% sparsity pruning with much lower perplexity than baselines.
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning cs.LG · 2026-04-24 · unverdicted · none · ref 32 · internal anchor
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.
Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach cs.LG · 2026-04-23 · unverdicted · none · ref 50 · internal anchor
ProjRes achieves near-100% accuracy in membership inference on FedLLMs by measuring projection residuals of hidden embeddings on gradient subspaces, outperforming prior methods by up to 75.75% even under differential privacy.
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving cs.DC · 2026-04-22 · unverdicted · none · ref 53 · internal anchor
FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling cs.LG · 2026-04-22 · unverdicted · none · ref 38 · internal anchor
R2IF improves LLM function-calling accuracy by up to 34.62% on BFCL using a composite reward system with CER and SMV components optimized via GRPO, while increasing interpretability through positive CoT effectiveness.
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages eess.AS · 2026-04-21 · unverdicted · none · ref 191 · internal anchor
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
Super Apriel: One Checkpoint, Many Speeds cs.LG · 2026-04-21 · unverdicted · none · ref 64 · internal anchor
A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.
SimDiff: Depth Pruning via Similarity and Difference cs.AI · 2026-04-21 · unverdicted · none · ref 15 · internal anchor
SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms cs.CL · 2026-04-21 · unverdicted · none · ref 9 · internal anchor
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning cs.CL · 2026-04-19 · unverdicted · none · ref 26 · internal anchor
CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels cs.PL · 2026-04-16 · unverdicted · none · ref 45 · internal anchor
Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.
Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging cs.CV · 2026-04-13 · unverdicted · none · ref 14 · internal anchor
MERIT restores temporal reasoning in VLMs via layer-selective self-attention merging guided by a TR-improving objective that penalizes TP degradation.
Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment cs.CL · 2026-04-12 · unverdicted · none · ref 25 · internal anchor
Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.
EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks cs.CV · 2026-04-10 · unverdicted · none · ref 52 · internal anchor
EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.
Learning Vision-Language-Action World Models for Autonomous Driving cs.CV · 2026-04-10 · unverdicted · none · ref 68 · internal anchor
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice cs.CL · 2026-04-10 · unverdicted · none · ref 51 · internal anchor
TaxPraBen is a new benchmark with 14 datasets and a structured evaluation method for measuring LLM performance on Chinese real-world tax tasks and scenarios.
Can LLMs Deobfuscate Binary Code? A Systematic Analysis of Large Language Models into Pseudocode Deobfuscation cs.SE · 2026-04-09 · unverdicted · none · ref 77 · internal anchor
LLM deobfuscation of binaries to pseudocode depends more on reasoning ability and task-specific fine-tuning than on model size, with reasoning models showing robustness across ISAs and obfuscation levels on the new BinDeObfBench.
On the Invariants of Softmax Attention cs.LG · 2026-04-06 · unverdicted · none · ref 16 · internal anchor
Softmax attention has algebraic invariants including zero-sum rows and head-dimension rank limits, plus consistent variance spread in language models attributed to key incoherence.
Large Language Models Align with the Human Brain during Creative Thinking q-bio.NC · 2026-04-03 · unverdicted · none · ref 18 · internal anchor
LLMs show scaling and training-dependent alignment with human brain responses in creativity-related networks during divergent thinking tasks, measured via RSA on fMRI data.
When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models cs.CV · 2026-04-01 · unverdicted · none · ref 44 · internal anchor
Attention sinks in LVLM create a global-vs-local trade-off that a layer-wise gating module can balance to improve multimodal benchmark performance.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs cs.CL · 2024-12-30 · unverdicted · none · ref 58 · internal anchor
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents cs.CR · 2024-10-03 · unverdicted · none · ref 152 · internal anchor
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.

Qwen2 Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer