super hub Canonical reference

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bingxuan Wang, Bing Xue, Bochao Wu, Chengda Lu · 2024 · cs.CL · arXiv 2412.19437

Canonical reference. 71% of citing Pith papers cite this work as background.

766 Pith papers citing it

Background 71% of classified citations

open full Pith review browse 766 citing papers more from Aixin Liu arXiv PDF

abstract

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 128 method 18 baseline 12 dataset 5 other 4

citation-polarity summary

background 118 use method 20 baseline 12 unclear 10 use dataset 4 support 3

claims ledger

abstract We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning

authors

Aixin Liu Bei Feng Bingxuan Wang Bing Xue Bochao Wu Chengda Lu

co-cited works

representative citing papers

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoMLA applies multi-head latent attention with 3D-RoPE decoupling to autoregressive video diffusion, delivering 92.7% KV memory reduction while matching short-horizon baselines and leading long-horizon VBench scores.

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

cs.CL · 2026-05-13 · unverdicted · novelty 8.0 · 2 refs

Mistletoe introduces a stealthy attack on speculative decoding that collapses acceleration by reducing average accepted length while preserving output semantics.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

econ.EM · 2026-05-13 · accept · novelty 8.0

EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

MappingEvolve: LLM-Driven Code Evolution for Technology Mapping

cs.CE · 2026-04-29 · unverdicted · novelty 8.0

MappingEvolve applies LLMs through Planner-Evolver-Evaluator agents to evolve technology mapping code, delivering 10.04% area reduction versus ABC and 7.93% versus mockturtle on EPFL benchmarks.

Revisable by Design: A Theory of Streaming LLM Agent Execution

cs.LG · 2026-04-25 · unverdicted · novelty 8.0

LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less completed work.

CHASM: Unveiling Covert Advertisements on Chinese Social Media

cs.LG · 2026-04-22 · unverdicted · novelty 8.0

CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

cs.DC · 2026-04-11 · unverdicted · novelty 8.0

Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

q-fin.CP · 2026-04-03 · conditional · novelty 8.0

Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.

AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks

cs.AI · 2026-04-01 · unverdicted · novelty 8.0

AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.

Bit-Accurate Modeling of GPU Matrix Multiply-Accumulate Units: Demystifying Numerical Discrepancy and Accuracy

cs.AR · 2025-11-14 · accept · novelty 8.0

The authors derive the first bit-accurate arithmetic models for matrix multiply-accumulate operations on ten GPU architectures spanning NVIDIA Volta to Blackwell and AMD CDNA1 to CDNA3.

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

cs.CL · 2025-10-20 · unverdicted · novelty 8.0

SimBench unifies 20 datasets into the first large-scale benchmark, finding top LLMs reach only modest human simulation fidelity of 40.8/100 with log-linear scaling by size and an alignment tradeoff on diverse questions.

MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

cs.CL · 2025-07-28 · accept · novelty 8.0

MediQAl is a new French medical QA benchmark with 32k exam-sourced questions in three formats and cognitive labels, evaluated on 14 LLMs to reveal gaps between factual recall and reasoning performance.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

Introduces Video-MME-Logical benchmark for controlled diagnostic evaluation of temporal-logical reasoning in MLLMs via five operations and 25 fine-grained tasks.

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

cs.DC · 2026-06-23 · unverdicted · novelty 7.0

CrossPool separates weights and KV-cache into distinct GPU pools plus a planner, virtualizer, and layer-wise scheduler to cut P99 time-between-tokens by up to 10.4x versus prior kvcached multi-LLM systems.

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

cs.LG · 2026-06-13 · unverdicted · novelty 7.0

StarOR couples MCTS with GRPO-based test-time RL and unsupervised rewards to adapt optimization modeling policies instance-specifically, reporting SOTA results on five benchmarks with a 4B model.

RWGBench: Evaluating Scholarly Positioning in Related Work Generation

cs.DL · 2026-05-30 · unverdicted · novelty 7.0

RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.

HetCCL: Enabling Collective Communication For Mixed-Vendor Heterogeneous Clusters

cs.NI · 2026-05-29 · unverdicted · novelty 7.0

HetCCL enables efficient collective communication across mixed-vendor GPU clusters via P2P transport and a border-communicator mechanism, delivering 17-19x higher bandwidth than Gloo and up to 16.9% faster end-to-end LLM training.

citing papers explorer

Showing 50 of 766 citing papers.

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion cs.CV · 2026-05-28 · unverdicted · none · ref 15 · internal anchor
VideoMLA applies multi-head latent attention with 3D-RoPE decoupling to autoregressive video diffusion, delivering 92.7% KV memory reduction while matching short-horizon baselines and leading long-horizon VBench scores.
Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding cs.CL · 2026-05-13 · unverdicted · none · ref 11 · 2 links · internal anchor
Mistletoe introduces a stealthy attack on speculative decoding that collapses acceleration by reducing average accepted length while preserving output semantics.
HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts cs.LG · 2026-05-13 · unverdicted · none · ref 39 · internal anchor
HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.
EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data econ.EM · 2026-05-13 · accept · none · ref 35 · internal anchor
EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models cs.AR · 2026-05-11 · conditional · none · ref 43 · internal anchor
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning cs.LG · 2026-05-09 · conditional · none · ref 6 · internal anchor
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.
MappingEvolve: LLM-Driven Code Evolution for Technology Mapping cs.CE · 2026-04-29 · unverdicted · none · ref 17 · internal anchor
MappingEvolve applies LLMs through Planner-Evolver-Evaluator agents to evolve technology mapping code, delivering 10.04% area reduction versus ABC and 7.93% versus mockturtle on EPFL benchmarks.
Revisable by Design: A Theory of Streaming LLM Agent Execution cs.LG · 2026-04-25 · unverdicted · none · ref 2 · internal anchor
LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less completed work.
CHASM: Unveiling Covert Advertisements on Chinese Social Media cs.LG · 2026-04-22 · unverdicted · none · ref 21 · internal anchor
CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.
VoxSafeBench: Not Just What Is Said, but Who, How, and Where cs.SD · 2026-04-16 · unverdicted · none · ref 36 · internal anchor
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models cs.CL · 2026-04-13 · conditional · none · ref 34 · internal anchor
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation cs.DC · 2026-04-11 · unverdicted · none · ref 59 · internal anchor
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models cs.CV · 2026-04-05 · unverdicted · none · ref 20 · internal anchor
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data q-fin.CP · 2026-04-03 · conditional · none · ref 16 · internal anchor
Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks cs.AI · 2026-04-01 · unverdicted · none · ref 6 · internal anchor
AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.
Bit-Accurate Modeling of GPU Matrix Multiply-Accumulate Units: Demystifying Numerical Discrepancy and Accuracy cs.AR · 2025-11-14 · accept · none · ref 4 · internal anchor
The authors derive the first bit-accurate arithmetic models for matrix multiply-accumulate operations on ten GPU architectures spanning NVIDIA Volta to Blackwell and AMD CDNA1 to CDNA3.
SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors cs.CL · 2025-10-20 · unverdicted · none · ref 1 · internal anchor
SimBench unifies 20 datasets into the first large-scale benchmark, finding top LLMs reach only modest human simulation fidelity of 40.8/100 with log-linear scaling by size and an alignment tradeoff on diverse questions.
MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation cs.CL · 2025-07-28 · accept · none · ref 4 · internal anchor
MediQAl is a new French medical QA benchmark with 32k exam-sourced questions in three formats and cognitive labels, evaluated on 14 LLMs to reveal gaps between factual recall and reasoning performance.
Large Language Diffusion Models cs.CL · 2025-02-14 · unverdicted · none · ref 96 · internal anchor
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning cs.CV · 2026-06-26 · unverdicted · none · ref 38 · internal anchor
Introduces Video-MME-Logical benchmark for controlled diagnostic evaluation of temporal-logical reasoning in MLLMs via five operations and 25 fine-grained tasks.
CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation cs.DC · 2026-06-23 · unverdicted · none · ref 15 · internal anchor
CrossPool separates weights and KV-cache into distinct GPU pools plus a planner, virtualizer, and layer-wise scheduler to cut P99 time-between-tokens by up to 10.4x versus prior kvcached multi-LLM systems.
StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling cs.LG · 2026-06-13 · unverdicted · none · ref 1 · internal anchor
StarOR couples MCTS with GRPO-based test-time RL and unsupervised rewards to adapt optimization modeling policies instance-specifically, reporting SOTA results on five benchmarks with a 4B model.
RWGBench: Evaluating Scholarly Positioning in Related Work Generation cs.DL · 2026-05-30 · unverdicted · none · ref 6 · internal anchor
RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.
HetCCL: Enabling Collective Communication For Mixed-Vendor Heterogeneous Clusters cs.NI · 2026-05-29 · unverdicted · none · ref 15 · internal anchor
HetCCL enables efficient collective communication across mixed-vendor GPU clusters via P2P transport and a border-communicator mechanism, delivering 17-19x higher bandwidth than Gloo and up to 16.9% faster end-to-end LLM training.
Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them cs.LG · 2026-05-29 · conditional · none · ref 87 · internal anchor
Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.
Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs cs.CL · 2026-05-29 · unverdicted · none · ref 81 · internal anchor
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
ElasticMem: Latent Memory as a Learnable Resource for LLM Agents cs.CL · 2026-05-29 · unverdicted · none · ref 24 · internal anchor
ElasticMem enables LLM agents to learn adaptive latent memory retrieval and elastic budget allocation, improving QA accuracy by 24-26% and ALFWorld success by 27-66% over baselines with lower token cost.
EvoRepair: Enhancing Vulnerability Repair Agents Through Experience-Based Self-Evolution cs.SE · 2026-05-28 · unverdicted · none · ref 45 · internal anchor
EvoRepair is the first experience-based self-evolving agent framework for automated vulnerability repair, reporting 90.46% overall success on PATCHEVAL and SEC-bench benchmarks.
ActTraitBench: Quantifying the Knowledge-Decision Gap in Large Language Models via Human-Grounded Behavioral Validation cs.CL · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
ActTraitBench is a human-grounded benchmark using psychometric-to-behavior mappings and quantile calibration that reveals pervasive knowledge-decision gaps in 14 LLMs, larger in capable models, with CoCA proposed as mitigation.
Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language Models cs.CL · 2026-05-28 · unverdicted · none · ref 2 · internal anchor
Kronecker Embeddings replace learned embedding tables with a deterministic byte-level character-position factorization and single projection, reducing parameters over 90% with reported gains in loss and robustness on language modeling tasks.
SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference cs.DC · 2026-05-27 · unverdicted · none · ref 16 · internal anchor
SiDP distributes model weights across a DP group with WaS and CaS modes to increase KV cache capacity by up to 1.8x and end-to-end throughput by up to 1.5x over vLLM on H20/H200/B200 GPUs for offline LLM inference.
LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations? cs.AI · 2026-05-26 · unverdicted · none · ref 25 · internal anchor
LiveK12Bench is a growing multi-disciplinary benchmark showing LMMs like GPT-5 drop from 79 to 53 under realistic exam constraints including process rigor and efficiency.
Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games cs.AI · 2026-05-26 · unverdicted · none · ref 19 · internal anchor
The paper introduces a multi-turn interactive benchmark using 474 executable games to evaluate LLMs on evidence acquisition, belief updating, contextual robustness, and metacognitive adaptation, revealing large performance gaps and sensitivity to perturbations.
Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization cs.LG · 2026-05-25 · unverdicted · none · ref 27 · internal anchor
Step-TP is a dataset providing grounded, atomic step-level IR transitions and CoT supervision to enable reliable multi-step LLM-guided tensor program optimization instead of end-to-end imitation.
CachePrune: Privacy-Aware and Fine-Grained KV Cache Sharing for Efficient LLM Inference cs.CR · 2026-05-22 · unverdicted · none · ref 23 · internal anchor
CachePrune enables fine-grained, token-level KV cache reuse across LLM requests by masking sensitive segments, eliminating direct side-channel leakage while cutting TTFT by 4.5x and raising hit rates by 44% versus prior coarse-grained methods.
Unextractable Protocol Models: Collaborative Training and Inference without Weight Materialization cs.LG · 2026-05-22 · unverdicted · none · ref 10 · internal anchor
UPMs apply periodic time-varying random invertible transforms to sharded model components in decentralized setups to render cross-time assemblies incoherent while preserving network function and incurring minimal overhead.
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most cs.AI · 2026-05-21 · unverdicted · none · ref 61 · 2 links · internal anchor
More capable LLMs produce worse distributional forecasts on superlinear growth time series with tail risks of regime change, with the error concentrated in the upper tail; this reverses on conventional threshold metrics.
Frontier: Towards Comprehensive and Accurate LLM Inference Simulation cs.DC · 2026-05-20 · unverdicted · none · ref 36 · internal anchor
Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error versus prior simulators.
SG-LegalCite: A Principle-Augmented Benchmark for Legal Citation Retrieval in Singapore Law cs.IR · 2026-05-20 · unverdicted · none · ref 2 · internal anchor
SG-LegalCite supplies 100,890 case-principle pairs from 8,523 Singapore Supreme Court judgments to enable retrieval models that rank precedents using both facts and governing legal principles.
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning cs.CL · 2026-05-19 · unverdicted · none · ref 19 · internal anchor
CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.
SCARA: A Semantics-Constrained Autonomous Remediation Agent for Opaque Industrial Software Vulnerabilities cs.CR · 2026-05-19 · unverdicted · none · ref 57 · internal anchor
SCARA introduces a four-stage pipeline using state-aware verification and constrained synthesis to remediate vulnerabilities in source-unavailable industrial software, reporting 100% precision and 88.9% success on a 15-case benchmark.
Prompt Compression in Diffusion Large Language Models: Evaluating LLMLingua-2 on LLaDA cs.CL · 2026-05-18 · unverdicted · none · ref 7 · internal anchor
Evaluating LLMLingua-2 at 2x compression on LLaDA shows non-uniform transfer to diffusion LLMs, with mathematical reasoning degrading substantially despite high BERTScore while summarization remains more robust.
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era cs.LG · 2026-05-17 · unverdicted · none · ref 27 · internal anchor
Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable reasoning on high-RP samples.
UniPPTBench: A Unified Benchmark for Presentation Generation Across Diverse Input Settings cs.CV · 2026-05-17 · conditional · none · ref 42 · internal anchor
The paper presents UniPPTBench and UniPPTEval, a unified benchmark and scenario-aware evaluation framework for presentation generation from vague prompts, long documents, multimodal documents, and multi-source inputs.
HalluScore: Large Language Model Hallucination Question Answering Benchmark cs.CL · 2026-05-16 · unverdicted · none · ref 58 · internal anchor
HalluScore is a curated Arabic QA dataset with 827 questions, ground-truth evidence, and human annotations used to measure hallucination rates across 17 LLMs.
LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning cs.AI · 2026-05-15 · unverdicted · none · ref 4 · internal anchor
LinAlg-Bench shows LLMs switch from execution errors to computational abandonment and structured fabrication at 4x4 matrix scale, indicating a working memory limit rather than knowledge gaps.
Dynamic Chunking for Diffusion Language Models cs.CL · 2026-05-15 · unverdicted · none · ref 26 · internal anchor
DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems cs.AI · 2026-05-14 · unverdicted · none · ref 2 · 2 links · internal anchor
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation cs.CL · 2026-05-13 · accept · none · ref 6 · internal anchor
Document-level machine translation followed by segment-level LLM refinement provides the strongest and most stable improvements in literary translation quality, mainly enhancing fluency and style rather than adequacy.
FIND: Toward Multimodal Financial Reasoning and Question Answering for Indic Languages cs.CL · 2026-05-13 · unverdicted · none · ref 4 · internal anchor
FinVQA is a new multilingual benchmark for Indic financial VQA with three difficulty levels and four formats, paired with the FIND framework for faithful numerical reasoning via fine-tuning and constrained decoding.

DeepSeek-V3 Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer