MultiHashFormer enables hash-based autoregression in LMs by encoding tokens as multi-hash signatures, outperforming standard Transformers at 100M-3B scales while keeping parameter count constant for multilingual expansion.
hub Mixed citations
Social IQ a: Commonsense reasoning about social interactions
Mixed citation behavior. Most common role is unclear (50%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
ConQuR is a post-training rotation calibration technique that aligns activations to hypercube corners via Procrustes optimization and online updates, delivering competitive LLM quantization performance without end-to-end training or offline activation storage.
EdgeFlowerTune is a real-device benchmark that jointly assesses model quality and system costs for federated LLM fine-tuning on edge hardware using three protocols: Quality-under-Budget, Cost-to-Target, and Robustness.
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
Persona-driven generations by LLMs in MCQA tasks exhibit instability that differs systematically by model family, size, domain, and prompt format.
ConSA learns FA/SWA allocation via L0 masks and augmented Lagrangian constraints, outperforming rule-based baselines on 0.6B and 1.7B models with consistent layer patterns.
Proxy metrics from next-token distributions over expert solutions outperform loss and compute baselines for ranking LLMs, selecting pretraining data, and extrapolating performance across compute scales.
ALSO frames social agent interactions as an adversarial bandit problem with a neural reward predictor to enable online strategy optimization in non-stationary multi-agent simulations.
Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
TalkLoRA equips MoE-LoRA experts with a communication module that smooths routing dynamics and improves performance on language tasks under similar parameter budgets.
PEFT-Bench is a standardized end-to-end benchmark for 7 PEFT methods across 27 NLP datasets on autoregressive LLMs, accompanied by the PSCP metric that penalizes based on trainable parameters, inference speed, and training memory.
LLM moral robustness under persona role-play is largely determined by model family with Claude models most consistent, while susceptibility shows little family dependence.
Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on language, reasoning, and commonsense benchmarks.
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
Mixture-of-Control adaptively combines local and global control states in transformer fine-tuning by treating per-block states as experts in a sparse MoE setup to improve cross-block communication while keeping memory and compute costs comparable to prior state-based methods.
Einstein World Models integrate visual rollouts from a callable world-module into LLM reasoning traces to support complex thought beyond language.
Marco-MoE delivers open multilingual MoE models with 5% activation sparsity that outperform similarly sized dense models on English and multilingual benchmarks through efficient upcycling.
GiVA uses gradients to initialize vector adapters so they match LoRA performance at eight times lower rank while keeping extreme parameter efficiency.
PEFT-Factory supplies a ready-to-use, extensible codebase that unifies 19 PEFT methods and evaluation pipelines for fine-tuning large autoregressive language models.
Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.
Hybrid entropy-uncertainty-geometric defence improves clean accuracy by up to 43% and adversarial robustness by up to 65% on NLU and security benchmarks.
citing papers explorer
-
ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs
ConQuR is a post-training rotation calibration technique that aligns activations to hypercube corners via Procrustes optimization and online updates, delivering competitive LLM quantization performance without end-to-end training or offline activation storage.
-
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
-
Training Transformers for KV Cache Compressibility
Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
-
TalkLoRA: Communication-Aware Mixture of Low-Rank Adaptation for Large Language Models
TalkLoRA equips MoE-LoRA experts with a communication module that smooths routing dynamics and improves performance on language tasks under similar parameter budgets.
-
Titans: Learning to Memorize at Test Time
Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
-
DataComp-LM: In search of the next generation of training sets for language models
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
-
Mixture-of-Control: State-Aware Fine-Tuning for Transformer-based Models
Mixture-of-Control adaptively combines local and global control states in transformer fine-tuning by treating per-block states as experts in a sparse MoE setup to improve cross-block communication while keeping memory and compute costs comparable to prior state-based methods.
-
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
-
Learning in the Fisher Subspace: A Guided Initialization for LoRA Fine-Tuning
Fisher information from the target data distribution supplies a task-dependent criterion for selecting LoRA directions that outperforms weight-magnitude heuristics.