An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.
hub Mixed citations
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Mixed citation behavior. Most common role is background (60%).
abstract
In this paper we study yes/no questions that are naturally occurring --- meaning that they are generated in unprompted and unconstrained settings. We build a reading comprehension dataset, BoolQ, of such questions, and show that they are unexpectedly challenging. They often query for complex, non-factoid information, and require difficult entailment-like inference to solve. We also explore the effectiveness of a range of transfer learning baselines. We find that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT. Our best method trains BERT on MultiNLI and then re-trains it on our train set. It achieves 80.4% accuracy compared to 90% accuracy of human annotators (and 62% majority-baseline), leaving a significant gap for future work.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In this paper we study yes/no questions that are naturally occurring --- meaning that they are generated in unprompted and unconstrained settings. We build a reading comprehension dataset, BoolQ, of such questions, and show that they are unexpectedly challenging. They often query for complex, non-factoid information, and require difficult entailment-like inference to solve. We also explore the effectiveness of a range of transfer learning baselines. We find that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingl
- dataset els with different architectures, including LLaMA-2 [62], Mistral [28], and Mixtral [29], where Mixtral is a Mixture-of-Experts (MoE) model. Model accuracy is measured on multiple datasets, including the Massive Multitask Language Understanding (MMLU) [25] in the five-shot setting and zero-shot Commonsense QA benchmarks such as WinoGrande [58], PIQA [12], HellaSwag [68], ARC [16], BoolQ [15] and OBQA [45]. All evaluations are conducted using the Language Model Evaluation Harness [21]. 4.2 In-Net
- dataset MultiRC dataset encompasses around 6, 000 multi- sentence questions gathered from over 800 paragraphs. On average, each question offers about two valid answer alternatives out of a total of five. B. Datasets for Emergent: ICL, reasoning (CoT), instruction following This section centers on the benchmarks and datasets em- ployed to evaluate the emergent abilities of LLMs. • GSM8K [190] is designed to evaluate the model's ability for multi-step mathematical reasoning. GSM8K includes 8.5K linguistic
- background the chunk size C is the small fixed constant (More detail in the § D with Algorithm 1). Further, the chunk form of bt and theγ t,i are computed as following Eq.(10) and (11), b[t] = exp(µlog [t] +c log [t] )∈R C,(10) Γ[t] = exp(eAlog [t] )⊙ 1−exp(S log [t] ) ∈R C×C ,(11) where the chunk matrix eAlog [t] ,S log [t] ∈R C×C is computed, (eAlog [t] )ij = ( ¯αlog [t] +c log [t] )i −( ¯µlog [t] )j fori≥j,(12) (Slog [t] )ij = (clog [t] )j−1 −(c log [t] )i fori≥j.(13) These lower triangular matrices
- background with the hope of stimulating further study of test-time behavior of language models. 3.9.1 Arithmetic To test GPT-3's ability to perform simple arithmetic operations without task-specific training, we developed a small battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language: • 2 digit addition (2D+) - The model is asked to add two integers sampled uniformly from [0, 100), phrased in the form of a question, e.g. "Q: What is 48 plus 76? A: 124." • 2 digit subtr
- background Moreover, to ensure that the whole Mamba block output matches that of Hedgehog at initialization, we also set the parameters of the gate branch and the convolution so that they reduce to the identity operator. Additional details can be found in App. B. Attention scores normalization With the substitution in (7), the SSM mixer outputs Yϕ := ϕMLP(Q)ϕMLP(K)T V . (8) However, the Attention scores in this formula come in an un-normalized fashion. For the Attention scores formulation to more closely
co-cited works
representative citing papers
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding better performance than scratch training.
Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.
SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.
EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.
HOSL reduces client memory up to 3.7x versus full first-order split learning while staying within 0.20-4.23% accuracy on OPT models by pairing client zeroth-order estimation with server first-order optimization.
Deep Delta Learning replaces additive residual updates with a gated delta-rule that selectively overwrites residual content along learned directions, improving language modeling quality over standard ResNet-style accumulation.
MIDUS replaces duplicated FFN branches in depth up-scaling with head-wise memory layers using product-key retrieval and HIVE to deliver lightweight, head-conditioned residual capacity.
Exact Flow Linear Attention derives a closed-form exact update for delta-rule linear attention from continuous-time dynamics, removing Euler discretization error while preserving linear complexity and structure.
Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
UQ4CT integrates functional-level uncertainty calibration into mixture-of-experts LoRA fine-tuning via a dedicated loss, cutting expected calibration error by over 25% on multiple-choice and generative QA tasks.
SpinQuant learns optimal rotations to enable accurate 4-bit quantization of LLM weights, activations, and KV cache, reducing the zero-shot gap to full precision to 2.9 points on LLaMA-2 7B.
BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
UB-SMoE balances expert utilization in heterogeneous federated SMoE fine-tuning via Dynamic Modulated Routing and Universal Pseudo-Gradient, delivering up to 45% compute reduction and 8.7x performance gains for low-resource clients over prior LoRA-rank methods.
Nonlinear Bipolar Compensation with Bipolar Logarithmic Transformation reduces outlier effects in post-training quantization by performing compensation in a compressed transformed space.
PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.
MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
citing papers explorer
-
GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets
GAMMA is a post-training framework that learns stable module sensitivity rankings for mixed-precision LLM quantization and projects them to exact bit budgets via integer programming, enabling reuse across arbitrary memory targets.
-
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
-
Adaptive Spiking Neurons for Vision and Language Modeling
ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.
-
SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation
SOLAR introduces a self-optimizing agent using meta-learning on model weights and RL-driven strategy discovery for lifelong adaptation in LLMs, claiming superior performance on reasoning tasks across domains.
-
BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models
BPDQ creates variable quantization grids from bit-planes and scalar coefficients, refined iteratively with second-order data to minimize output error, enabling 2-bit serving of Qwen2.5-72B on one RTX 3090 at 83.85% GSM8K accuracy.
-
BoHA: Blockwise Hadamard Product Adaptation for Parameter-Efficient Fine-Tuning
BoHA partitions frozen weights into a b by b grid and applies independent low-rank Hadamard factors per block, outperforming LoRA on matched-budget single-task averages while retaining 57.66% first-stage accuracy in a commonsense-to-arithmetic continual-learning test on Llama-3.2-3B.
-
QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models
QWHA proposes Walsh-Hadamard Transform adapters with adaptive initialization for quantization-aware PEFT, claiming better low-bit accuracy and faster training than low-rank or other FT-based baselines.
-
SMoE: An Algorithm-System Co-Design for Pushing MoE to the Edge via Expert Substitution
SMoE substitutes low-importance experts with cached similar ones in MoE inference on edge devices to achieve 48% lower decoding latency and over 60% cache hit rate with nearly lossless accuracy.
-
RAP: Runtime Adaptive Pruning for LLM Inference
RAP is a reinforcement learning framework for runtime-adaptive pruning of LLMs that jointly optimizes model weights and KV-cache usage under varying memory budgets.
-
Mixtral of Experts
Mixtral 8x7B is a sparse MoE LLM activating 2 of 8 experts per layer that matches or exceeds Llama 2 70B and GPT-3.5 on benchmarks while using only 13B active parameters.
-
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
MobileVLM achieves on-par performance with much larger vision-language models on standard benchmarks while delivering state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 CPU and 65.3 on Jetson Orin GPU.
-
Mistral 7B
Mistral 7B is a 7B-parameter LLM that outperforms Llama 2 13B across benchmarks via grouped-query attention and sliding-window attention while remaining efficient.
-
Galactica: A Large Language Model for Science
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
-
Dynamic Nested Hierarchies: Pioneering Self-Evolution in Machine Learning Architectures for Lifelong Intelligence
Dynamic nested hierarchies let models self-adjust their multi-level optimization structures to support lifelong learning and adaptation to shifting data distributions.
-
From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems
Coreference resolution improves retrieval relevance and QA performance in RAG systems, with mean pooling performing best and smaller models benefiting more.
-
Gemma 3 Technical Report
Gemma 3 introduces multimodal open models with architectural changes for efficient long context, trained via distillation and a new post-training recipe that makes the 4B version competitive with prior 27B models and the 27B version comparable to Gemini-1.5-Pro.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
-
Gemma: Open Models Based on Gemini Research and Technology
Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
-
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
- Learn-To-Learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM
- River-LLM: Large Language Model Seamless Exit Based on KV Share
- Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse