EDA decouples erase and write addresses in delta-rule linear attention by adding a targeted erase step along a learned direction before the corrective write, yielding best results on 2.5B dense and 25B MoE models in pretraining and long-context tasks.
hub
Challenging
21 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
LLMEval-Logic is a solver-verified Chinese logical reasoning benchmark with 246 base and 190 hard items that shows frontier LLMs reach only 37.5% hard-item accuracy and 60.16% joint formalization score.
ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.
SpecRef hybrid AR-diffusion decoding is tested on six benchmarks with three protocols, showing code benchmarks conflate structural and logical correctness, refinement can degrade correct tokens, and log-likelihood versus generative scoring produce inconsistent model rankings.
Elo rankings from pairwise judgments correlate above 0.9 Spearman with accuracy rankings on five converted benchmarks, with minor style and bias effects.
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
CRGC models instructions as constraint graphs, identifies bridge constraints, and cuts violations by 39% on three datasets while preserving reasoning performance.
Task-aware expert grouping derived from family-specific co-activation traces cuts average communication cost 31.39% versus task-agnostic baselines in multi-task MoE inference while maintaining Jain fairness near 1.0.
LSP adds hierarchical hyperpriors over global sparsity and weight concentration parameters so that spike-and-slab models can discount inaccurate LLM weights while retaining gains when the weights are good.
SCALE-LoRA proposes a post-retrieval audit framework using sparse residual composition and disagreement-based reliability signals to improve open-pool LoRA adapter reuse on tasks like BIG-Bench Hard.
LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
Einstein World Models integrate visual rollouts from a callable world-module into LLM reasoning traces to support complex thought beyond language.
Continual training recipe upcycles dense Qwen2.5-8B LLM to 4x channel-sparse model via predictor-gated bank-wise sparsity in SwiGLU FFN with a single-layer repair for long-context failure on RULER-CWE.
LLMs outperform humans in expressing illocutionary intents and sycophancy in successful persuasive counter-arguments from ChangeMyView, with crowd workers preferring LLM versions.
GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.
LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.
InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.
Influcoder distills decoders' gradient influence rankings into an encoder for scalable influence-based data attribution.
MetaEvo is a two-stage framework using preference optimization for principle abstraction followed by modular reuse to enable continual improvement of LLM agents on reasoning tasks.
Kernel ridge regression combined with mRMR feature selection improves prediction of full benchmark scores from question subsets over existing efficient benchmarking techniques.
citing papers explorer
-
Erase-then-Delta Attention: Decoupling Erase and Write Addresses in Delta-Rule Linear Attention
EDA decouples erase and write addresses in delta-rule linear attention by adding a targeted erase step along a learned direction before the corrective write, yielding best results on 2.5B dense and 25B MoE models in pretraining and long-context tasks.
-
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
LLMEval-Logic is a solver-verified Chinese logical reasoning benchmark with 246 base and 190 hard items that shows frontier LLMs reach only 37.5% hard-item accuracy and 60.16% joint formalization score.
-
ToxiREX: A Dataset on Toxic REasoning in ConteXt
ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.
-
Speculative Refinement: A Hybrid Autoregressive Diffusion Decoding Strategy and Its Behavior Across Benchmarks
SpecRef hybrid AR-diffusion decoding is tested on six benchmarks with three protocols, showing code benchmarks conflate structural and logical correctness, refinement can degrade correct tokens, and log-likelihood versus generative scoring produce inconsistent model rankings.
-
Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings
Elo rankings from pairwise judgments correlate above 0.9 Spearman with accuracy rankings on five converted benchmarks, with minor style and bias effects.
-
CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
-
Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models
CRGC models instructions as constraint graphs, identifies bridge constraints, and cuts violations by 39% on three datasets while preserving reasoning performance.
-
Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference
Task-aware expert grouping derived from family-specific co-activation traces cuts average communication cost 31.39% versus task-agnostic baselines in multi-task MoE inference while maintaining Jain fairness near 1.0.
-
LLM Sparsity Prior for Robust Feature Selection
LSP adds hierarchical hyperpriors over global sparsity and weight concentration parameters so that spike-and-slab models can discount inaccurate LLM weights while retaining gains when the weights are good.
-
SCALE-LoRA: Auditing Post-Retrieval LoRA Composition with Residual Merging and View Reliability
SCALE-LoRA proposes a post-retrieval audit framework using sparse residual composition and disagreement-based reliability signals to improve open-pool LoRA adapter reuse on tasks like BIG-Bench Hard.
-
LightThinker++: From Reasoning Compression to Memory Management
LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
-
Einstein World Models
Einstein World Models integrate visual rollouts from a callable world-module into LLM reasoning traces to support complex thought beyond language.
-
Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs
Continual training recipe upcycles dense Qwen2.5-8B LLM to 4x channel-sparse model via predictor-gated bank-wise sparsity in SwiGLU FFN with a single-layer repair for long-context failure on RULER-CWE.
-
"I understand your perspective": LLM Persuasion and Sycophancy through the Lens of Communicative Action Theory
LLMs outperform humans in expressing illocutionary intents and sycophancy in successful persuasive counter-arguments from ChangeMyView, with crowd workers preferring LLM versions.
-
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.
-
Language models fail at extended rule following
LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.
-
MiMo-V2-Flash Technical Report
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.
-
InternLM2 Technical Report
InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.
-
Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution
Influcoder distills decoders' gradient influence rankings into an encoder for scalable influence-based data attribution.
-
MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution
MetaEvo is a two-stage framework using preference optimization for principle abstraction followed by modular reuse to enable continual improvement of LLM agents on reasoning tasks.
-
Efficient Benchmarking Is Just Feature Selection and Multiple Regression
Kernel ridge regression combined with mRMR feature selection improves prediction of full benchmark scores from question subsets over existing efficient benchmarking techniques.