HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
super hub
Qwen2.5-Coder Technical Report
127 Pith papers cite this work. Polarity classification is still indexing.
abstract
In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes six models: Qwen2.5-Coder-(0.5B/1.5B/3B/7B/14B/32B). As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general and math skills. These models have been evaluated on a wide range of code-related tasks, achieving state-of-the-art (SOTA) performance across more than 10 benchmarks, including code generation, completion, reasoning, and repair, consistently outperforming larger models of the same model size. We believe that the release of the Qwen2.5-Coder series will advance research in code intelligence and, with its permissive licensing, support wider adoption by developers in real-world applications.
hub tools
claims ledger
- abstract In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes six models: Qwen2.5-Coder-(0.5B/1.5B/3B/7B/14B/32B). As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general and math skills. These models have been evaluated on a wide range of code-rel
authors
co-cited works
representative citing papers
The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
HLS-Seek replaces full-synthesis RL with a comparative proxy reward model plus uncertainty-triggered real checks, yielding higher correctness and better QoR than larger models at 8.5x lower training cost.
Reward-Weighted On-Policy Distillation with an open property-equivalence verifier produces a 7B model that surpasses prior SOTA on NL-to-SVA generation across pass@1/5/10 metrics.
A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.
PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
PlantMarkerBench supplies 5,550 literature sentences annotated for plant marker gene evidence validity and type across Arabidopsis, maize, rice and tomato, showing frontier LLMs handle direct expression evidence but struggle with functional, indirect and weak-support cases.
Malicious Skills induce coding agents to hallucinate and import attacker-controlled packages at high rates while evading detection.
Mean-pooled cosine similarity grows with sequence length in anisotropic transformer embeddings independent of content, while CKA shows far less length dependence across code, translation, and vision tasks.
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
Fine-tuned 7B LLMs generating unified diffs for neural architecture refinement achieve 66-75% valid rates and 64-66% mean first-epoch accuracy, outperforming full-generation baselines by large margins while cutting output length by 75-85%.
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
ARISE adds a data-flow-augmented repository graph and three-tier tool API to LLM agents, raising Function Recall@1 by 17 points, Line Recall@1 by 15 points, and Pass@1 repair rate to 22% on SWE-bench Lite.
LiveFMBench shows that direct LLM prompting for C program formal specs overestimates accuracy by ~20% due to unfaithful behaviors like deceiving provers, while agentic workflows help under low sampling but overall performance remains far below human-authored specs.
Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
TriMix dynamically fuses logits from three model sources to outperform baselines and Proxy Tuning on eight low-resource languages across four model families.
citing papers explorer
-
HLS-Seek: QoR-Aware Code Generation for High-Level Synthesis via Proxy Comparative Reward Reinforcement Learning
HLS-Seek replaces full-synthesis RL with a comparative proxy reward model plus uncertainty-triggered real checks, yielding higher correctness and better QoR than larger models at 8.5x lower training cost.
-
Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization
DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.
-
Delta-Based Neural Architecture Search: LLM Fine-Tuning via Code Diffs
Fine-tuned 7B LLMs generating unified diffs for neural architecture refinement achieve 66-75% valid rates and 64-66% mean first-epoch accuracy, outperforming full-generation baselines by large margins while cutting output length by 75-85%.
-
Validity-Calibrated Reasoning Distillation
Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
-
Understanding and Accelerating the Training of Masked Diffusion Language Models
Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.
-
Step Rejection Fine-Tuning: A Practical Distillation Recipe
Step Rejection Fine-Tuning masks loss on erroneous steps identified by a critic LLM in unresolved trajectories, raising SWE-bench Verified resolution rate by 3.7% to 32.2% versus 2.4% for trajectory-level rejection.
-
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
-
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
-
POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles
POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery tasks such as protein search and quantum circuit design.
-
Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate
Mage shows compile-pass rate is anti-correlated with functional correctness in LLM game scene generation; direct NL-to-C# yields 43% runtime but F1~0.12 structure, while IR conditioning recovers structure (F1 up to 1.0) but halves runtime, with granularity levels statistically equivalent.
-
Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference
Feather uses reinforcement learning and a Chunked Hash Tree to balance batch size against prefix homogeneity in LLM inference, delivering 2-10x higher throughput than existing schedulers.
-
Conceptors for Semantic Steering
Conceptors as soft projection matrices from bipolar activations offer a multidimensional, compositional, and geometrically principled method for semantic steering in LLMs that outperforms single-vector baselines in multi-dimensional subspaces.
-
AutoPPA: Automated Circuit PPA Optimization via Contrastive Code-based Rule Library Learning
AutoPPA learns generalizable PPA optimization rules automatically via contrastive abstraction from diverse code pairs and applies them through adaptive search, outperforming manual methods and prior tools SymRTLO and RTLRewriter.
-
TOPCELL: Topology Optimization of Standard Cell via LLMs
TOPCELL reformulates standard cell topology optimization as an LLM generative task with GRPO fine-tuning, outperforming base models and matching exhaustive solvers with 85.91x speedup in 2nm/7nm industrial flows.
-
Interactive Critique-Revision Training for Reliable Structured LLM Generation
DPA-GRPO trains a generator-verifier pair via group-relative policy optimization on paired counterfactual actions, improving structured output accuracy on TaxCalcBench over zero-shot and generator-only baselines.
-
Graph Representation Learning Augmented Model Manipulation on Federated Fine-Tuning of LLMs
Graph representation learning plus iterative augmented Lagrangian optimization creates stronger, harder-to-detect model manipulation attacks on federated LLM fine-tuning, cutting global accuracy by up to 26%.
-
Attribution-Guided Continual Learning for Large Language Models
An attribution-based continual learning framework for LLMs modulates per-parameter gradients using task-specific importance scores to reduce forgetting of prior tasks.
-
Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective
CmIR uses causal inference to separate invariant causal representations from spurious ones in multimodal data, improving generalization under distribution shifts and noise via invariance, mutual information, and reconstruction constraints.
-
Text-Attributed Knowledge Graph Enrichment with Large Language Models for Medical Concept Representation
MedCo builds and text-enriches a medical knowledge graph using LLMs and EHR data, then fuses text and graph signals via joint LoRA-tuned LLaMA and heterogeneous GNN training to improve EHR clinical predictions on MIMIC datasets.
-
mcdok at SemEval-2026 Task 13: Finetuning LLMs for Detection of Machine-Generated Code
Fine-tuning LLMs by adapting the mdok approach produces competitive results on binary detection, source attribution, and hybrid/adversarial code identification in SemEval-2026 Task 13.