pith. machine review for the scientific record. sign in

arxiv: 2106.09685 · v2 · submitted 2021-06-17 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 4 theorem links

· Lean Theorem

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Lu Wang, Phillip Wallis, Shean Wang, Weizhu Chen, Yelong Shen, Yuanzhi Li, Zeyuan Allen-Zhu

Pith reviewed 2026-05-08 23:23 UTC · model claude-opus-4-7

classification 💻 cs.CL cs.AIcs.LG
keywords parameter-efficient fine-tuninglow-rank adaptationlarge language modelstransformer adaptationGPT-3intrinsic ranktransfer learning
0
0 comments X

The pith

Adapting a 175-billion-parameter language model to a new task needs only a rank-1 to rank-8 correction to its weights, trained at a ten-thousandth of the parameter cost of full fine-tuning and with no inference penalty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper makes a specific empirical bet: when you adapt an enormous pretrained language model to a new task, the change you need to make to its weights is, in a precise sense, almost flat — it sits in a subspace of rank one to a handful, even when the underlying weight matrices are tens of thousands wide. Acting on that bet, the authors freeze the pretrained weights and train only a tiny rank-r factorization BA added on top, then fold BA back into W at deployment. On GPT-3 at 175 billion parameters, this trains ten thousand times fewer parameters than full fine-tuning, uses about a third of the training memory, runs at the same inference speed as the original model, and matches or exceeds full fine-tuning quality on a spread of NLU and NLG benchmarks. The reader should care because this turns "one fine-tuned model per task" into "one frozen base plus many tiny swappable task patches," which is a different deployment economics for large models. The paper also takes the rank-deficiency claim seriously enough to probe it: a subspace-overlap analysis argues that the top singular direction of the learned update already does most of the work, and that ΔW amplifies features already present in W rather than installing new ones.

Core claim

When a large pretrained transformer is adapted to a downstream task, the weight update ΔW it would learn under full fine-tuning lies, empirically, in a very low-dimensional subspace. The paper turns that observation into a method: freeze the pretrained weights W, and learn ΔW only as a product BA of two small matrices of rank r, where r can be as small as 1 to 8 even when the underlying matrix is 12,288 wide. On GPT-3 175B, this trains 10,000× fewer parameters, cuts training VRAM roughly threefold, and at deployment BA is folded back into W so inference cost is identical to the original model. Across RoBERTa, DeBERTa, GPT-2, and GPT-3 on GLUE, E2E, WikiSQL, MNLI, and SAMSum, the low-rank ada

What carries the argument

A low-rank additive reparameterization of each adapted weight matrix: W₀ + ΔW is replaced by W₀ + BA, where B ∈ ℝ^{d×r}, A ∈ ℝ^{r×k}, r ≪ min(d,k), with W₀ frozen and B initialized to zero so training starts from the pretrained model. Only A and B receive gradients; at deployment BA is summed into W₀ so the served model has identical shape and latency to the original. In the experiments only the attention query and value projections are adapted, with MLP, LayerNorm, and biases left frozen. A subspace-similarity analysis on the learned A matrices argues that the top singular directions of ΔW at r=8 already overlap strongly with those at r=64, supporting the low-intrinsic-rank claim.

If this is right

  • A single frozen 175B-parameter base model can serve many tasks at once, with each task carrying only tens of megabytes of low-rank weights that can be hot-swapped without reloading the base.
  • Because BA can be merged into W before deployment, the method introduces no extra inference latency — unlike adapter layers, which add measurable per-token cost in the small-batch online regime.
  • Training cost drops sharply: optimizer state and gradients are needed only for the small A and B, yielding ~3× lower VRAM and ~25% higher training throughput on GPT-3 175B.
  • The empirical finding that the top singular direction of ΔW already captures most of the useful adaptation suggests fine-tuning is largely amplifying a small set of features already latent in W rather than installing new ones.
  • The framework composes with prefix-tuning and similar input-side methods, so low-rank weight adaptation and prompt adaptation can be stacked rather than chosen between.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The amplification analysis — where ΔW magnifies directions in W by a factor of order 20 that were present but not emphasized — reframes downstream adaptation as feature reweighting rather than feature acquisition, and predicts that tasks requiring genuinely new knowledge should need larger r.
  • Because adapted task weights are tiny and additive, this method enables a serving architecture where one base model in VRAM hosts thousands of personalized or per-customer task heads, switched per request — a structural change in how fine-tuned models are deployed, not just trained.
  • The choice to adapt only Wq and Wv is heuristic; the same low-rank logic applied to MLP weights, which carry more of the model's parameter mass, may be where remaining gains lie for harder tasks.
  • If the update truly lives in a rank-1 subspace for many tasks, the per-task signature of fine-tuning is essentially a single direction per layer, which is a strong invariant that downstream interpretability and model-editing work could exploit.

Load-bearing premise

That the change a model needs to learn for a new task is small and lives in a few directions — true for tasks close to what the model was pretrained on, but not obviously true when the task demands substantially new knowledge or a very different language.

What would settle it

Run the same protocol on an adaptation that is genuinely far from the pretraining distribution — for example, adapting an English-pretrained model to fluent generation in a low-resource non-Indo-European language, or to a task requiring substantial new factual knowledge rather than reweighting of existing features. If a small r still matches full fine-tuning there, the low-intrinsic-rank claim generalizes; if quality collapses and only large r recovers it, the claim is bounded to near-distribution adaptation.

read the original abstract

An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 7 minor

Summary. The paper proposes LoRA, a parameter-efficient adaptation method for large pre-trained Transformers that freezes the pre-trained weights W and learns a low-rank additive update ΔW = BA with B ∈ R^{d×r}, A ∈ R^{r×k}, r ≪ min(d,k). Only A and B are trained. At deployment BA can be folded into W, so there is no inference-latency penalty relative to full fine-tuning (FT). Empirically, LoRA matches or exceeds FT on GLUE (RoBERTa-base/large, DeBERTa-XXL), on E2E/WebNLG/DART (GPT-2 medium/large), and on WikiSQL/MNLI/SAMSum (GPT-3 175B), while reducing trainable parameters by up to ~10,000× and training VRAM by ~3×. The paper also reports inference-latency overheads of adapter baselines (Table 1, App. B), and offers a §7 "understanding" study arguing that the learned ΔW has low intrinsic rank, that rank as small as 1 suffices for {W_q, W_v} on GPT-3 (Table 6), and that ΔW amplifies directions in W that were not previously emphasized (Table 7).

Significance. If the operational claim holds — and the empirical evidence across four model families and a wide range of tasks is strong — LoRA is a practically important contribution: it removes a real deployment bottleneck for very large models (per-task checkpoint size, optimizer-state memory, task switching) without the inference latency of adapter layers (Table 1, Fig. 5) or the sequence-length overhead of prefix tuning. The method is simple to implement, orthogonal to other PEFT techniques (App. E shows it composes with prefix tuning), and the authors release code and checkpoints, making the result reproducible. Tables 2–4 are consistent in showing LoRA at parity or above FT at 0.1–1% of the trainable-parameter count, which is a strong claim well supported on the cited benchmarks.

major comments (4)
  1. [§7.2–§7.3 (mechanistic claim)] The §7 analysis is offered as evidence that 'the update matrix ΔW could have a very small intrinsic rank' (§7.2) and is invoked in the abstract/§1/§4.1 to explain why LoRA works. However, every ΔW analyzed in §7 is itself produced by a rank-constrained LoRA run. Showing that the top directions of A_{r=8} sit inside A_{r=64} (Fig. 3) demonstrates stability of the LoRA optimization across r, not that the unconstrained adaptation update W_FT − W_0 from full fine-tuning has a rapidly decaying spectrum. Likewise, the 21× amplification factor in Table 7 is computed against U,V from BA's own SVD, which is r-dimensional by construction. To support the mechanistic claim, please report the singular-value spectrum of W_FT − W_0 on at least one task (e.g., MNLI on RoBERTa-large or GPT-2 medium where FT is tractable), and compare its truncation error at rank r against LoRA's task performance at the s
  2. [§4.2, §7.1 (which matrices to adapt)] The decision to adapt only W_q and W_v while freezing MLP weights is presented as a simplicity choice in §4.2 and partially justified post-hoc in §7.1 under an 18M-parameter budget on GPT-3. Since MLP parameters dominate Transformer parameter counts, and several follow-up PEFT works adapt MLPs, the 'attention-only' choice should be validated more carefully. Please add at least one row in Table 5 (and ideally on a smaller model where FT is feasible) where LoRA is also applied to W_{ff1}, W_{ff2}, to test whether the claim 'adapting both W_q and W_v gives the best performance' generalizes beyond attention-only budgets.
  3. [§5.5 / Table 4 (GPT-3 variance)] Headline GPT-3 numbers (e.g., LoRA 74.0 vs. FT 73.8 on WikiSQL; 91.6 vs. 89.5 on MNLI-m; 53.8/29.8/45.9 vs. 52.0/28.0/44.5 on SAMSum) are used to support 'matches or exceeds fine-tuning'. The caption states fluctuation 'around ±0.5%' on WikiSQL and ±0.1% on MNLI-m but does not state how many seeds were run, nor whether FT and LoRA were tuned with comparable hyperparameter budgets. Please specify the number of seeds, the tuning protocol (especially whether FT learning rate was tuned with a comparable budget — Table 12 lists a single LR), and report std rather than a global 'fluctuation' estimate, since several headline gaps (e.g., MNLI-m 91.6 vs. 89.5) are larger than the stated noise but the SAMSum gap is within it.
  4. [§4.1 (scaling α/r and initialization)] The α/r scaling and the asymmetric initialization (A ~ N(0,σ²), B = 0) are stated without ablation. Since the paper claims that 'tuning α is roughly the same as tuning the learning rate' and that this 'reduces the need to retune hyperparameters when we vary r' (citing Yang & Hu, 2021), but Table 6 varies r from 1 to 64 with what appears to be a fixed α, please clarify whether α was held fixed across the r-sweep and provide a small ablation showing sensitivity to α and to the A/B initialization choice. This matters for reproducibility because practitioners frequently report instability when r is changed without α.
minor comments (7)
  1. [Abstract / §1] The phrase 'no additional inference latency' is technically conditional on merging BA into W (and thus on serving one task at a time per replica). §4.2 acknowledges this for batched multi-task serving. Consider qualifying the abstract sentence accordingly.
  2. [§3 / Table 1] The adapter-latency comparison uses GPT-2 medium on a single GPU. It would be useful to state that this is the worst case for adapters and to point readers to App. B / Fig. 5 for the regime where the gap shrinks, so the framing is balanced.
  3. [§4.1] 'Aghajanyan et al. (2020) shows that the pre-trained language models have a low intrinsic dimension' — please note that their result is about the intrinsic dimension of the loss-landscape reparameterization, not the rank of the weight delta; the analogy to low-rank ΔW is suggestive rather than implied.
  4. [Table 2] The use of † to denote MNLI-initialization vs. pretrained-initialization runs is easy to miss. Consider grouping the † rows visually or splitting the table.
  5. [§7.3 / Table 7] Reporting only Frobenius norms collapses a lot of structure. A small plot of σ_i(U^⊤ W V^⊤) vs. i for ΔW, W-top-r, and Random would make the 'amplifies non-emphasized directions' claim much sharper.
  6. [App. D.4 / Table 12] Only one learning rate per method is listed for GPT-3. State explicitly whether this was the result of a tuning sweep and what the sweep range was; otherwise readers cannot judge whether FT was given a fair shot.
  7. [Typos] §3: 'Infernece' → 'Inference' (Table 1 caption). §4.1: 'instrisic' → 'intrinsic'. §6: 'comtenporary' → 'contemporary'. §6: 'deign' → 'design'.

Simulated Author's Rebuttal

4 responses · 1 unresolved

We thank the referee for the careful reading and for recommending acceptance. The four major comments are well taken: three of them ask for additional experiments and clearer reporting that strengthen claims we already make, and one identifies a genuine logical gap in §7 that we will address by softening the mechanistic claim and adding a direct spectrum analysis of the full-fine-tuning update on a tractable model. Below we respond point by point, indicate the revisions we will incorporate in v3, and flag the items we can only partially address within the compute budget available for GPT-3.

read point-by-point responses
  1. Referee: §7.2–§7.3: the §7 analysis only studies LoRA-produced ΔW, so subspace-overlap and amplification results show stability of the LoRA optimization, not that the unconstrained FT update W_FT − W_0 has a rapidly decaying spectrum. Please report the singular-value spectrum of W_FT − W_0 on a task where FT is tractable.

    Authors: This is a fair criticism and we will address it directly. The referee is correct that Fig. 3 and the φ(A_{r=8}, A_{r=64}) overlap demonstrate consistency of LoRA across r, not low-rankness of the unconstrained FT update, and that the 21× amplification in Table 7 is computed against U,V from BA's own SVD and is therefore r-dimensional by construction. In v3 we will: (i) compute the SVD of ΔW_FT = W_FT − W_0 for W_q and W_v on RoBERTa-large/MNLI and GPT-2 medium/E2E, plot the singular-value spectrum, and report rank-r truncation error ‖ΔW_FT − [ΔW_FT]_r‖_F / ‖ΔW_FT‖_F for r ∈ {1,2,4,8,16,64}; (ii) compare that truncation error to LoRA task performance at the same r; (iii) reword the abstract, §1 and §4.1 so that the low-intrinsic-rank statement is positioned as a hypothesis supported by (a) prior intrinsic-dimension results (Aghajanyan et al., 2020; Li et al., 2018a) and (b) the new FT-spectrum evidence, rather than being inferred from §7 alone. We will also retitle §7 to make clear it characterizes LoRA-learned updates and their relation to W, separately from the mechanistic claim about FT. revision: yes

  2. Referee: §4.2, §7.1: adapting only W_q, W_v is justified only post-hoc under an 18M budget. Please add experiments adapting MLP weights W_ff1, W_ff2 to test whether the 'best to adapt W_q and W_v' conclusion generalizes.

    Authors: We agree this is under-tested in the current draft. For v3 we will extend Table 5 with rows applying LoRA to {W_ff1}, {W_ff2}, {W_ff1, W_ff2}, and {W_q, W_v, W_ff1, W_ff2} at the same 18M budget on GPT-3, and we will add a complete sweep on RoBERTa-large/MNLI and GPT-2 medium/E2E where FT is tractable and we can compare against the FT spectrum on MLP matrices as well. We want to be candid: our conclusion in §7.1 was scoped to attention matrices under a fixed budget, and we did not claim MLP adaptation is useless — we explicitly listed it as future work in §4.2. If the new results show MLP adaptation is competitive or complementary, we will revise the recommendation in §4.2 accordingly rather than retain attention-only as a default. revision: yes

  3. Referee: §5.5/Table 4: please specify number of seeds, tuning protocol, and per-cell std rather than a global 'fluctuation' for the GPT-3 headline numbers; some gaps exceed the stated noise while SAMSum is within it.

    Authors: We will improve the reporting. Concretely: (i) the GPT-3 numbers in Table 4 used 2 seeds per cell for FT and 3 seeds for LoRA, with the global ±-figures derived as the maximum observed std across cells of the same dataset; we will replace this with per-cell mean ± std and state n explicitly. (ii) On the tuning-budget question: as the referee infers from Table 12, FT was tuned with a smaller LR sweep than LoRA because each FT run on GPT-3 175B is roughly an order of magnitude more expensive than a LoRA run; we will state this explicitly so readers can judge the comparison. We are unable to retroactively run a fully matched FT sweep on GPT-3 175B at the scale of the LoRA sweep, and we will be transparent about that asymmetry. (iii) We will soften the wording of headline gaps that fall within the new per-cell std (notably SAMSum) to 'on par' rather than 'exceeds', while retaining the stronger claim where the gap is robust (e.g., MNLI-m). revision: yes

  4. Referee: §4.1: please clarify whether α was held fixed across the Table 6 r-sweep and provide an ablation on α and on the A/B initialization choice.

    Authors: Clarification: in the Table 6 r-sweep, α was set to the first r tried (r=8 for the GPT-3 experiments, giving α=16 in our convention) and held fixed as r varied, exactly as described in §4.1. The α/r scaling means the effective update magnitude is preserved under changes in r, which is why the same learning rate works across the sweep — this is the operational claim we make following Yang & Hu (2021). We agree an explicit ablation will help reproducibility. For v3 we will add: (i) a small grid over α ∈ {r/2, r, 2r, 4r, 8r} at fixed r on RoBERTa-large/MNLI and GPT-2 medium/E2E, showing sensitivity; (ii) an ablation swapping the (A∼N(0,σ²), B=0) initialization for (A=0, B∼N(0,σ²)) and for (both Gaussian, then subtracting initial BA) to verify that the asymmetric init matters chiefly because it makes ΔW=0 at step 0, not because of any asymmetry between A and B per se. Practitioner reports of instability when changing r without α are consistent with our scaling prescription, and we will state the prescription as a recommendation rather than a passing remark. revision: yes

standing simulated objections not resolved
  • A fully matched hyperparameter sweep for full fine-tuning on GPT-3 175B (comparable in size to the LoRA sweep) is beyond our available compute; the asymmetry in tuning budget between FT and LoRA in Table 4 will be acknowledged in v3 but not eliminated.

Circularity Check

2 steps flagged

Operational claim is independent and well-supported; the mechanistic 'ΔW is intrinsically low-rank' explanation in §7 is partially circular because it analyzes LoRA-trained low-rank matrices rather than ΔW from full fine-tuning.

specific steps
  1. fitted input called prediction [Section 7.2, Table 6 and Figure 3]
    "Table 6 shows that, surprisingly, LoRA already performs competitively with a very small r... This suggests the update matrix ∆W could have a very small 'intrinsic rank'. ... Directions corresponding to the top singular vector overlap significantly between Ar=8 and Ar=64, while others do not."

    Both A_{r=8} and A_{r=64} are produced by LoRA training under an explicit rank-r constraint. Showing that the rank-8 solution's top directions are contained in the rank-64 solution does not bear on whether the unconstrained full fine-tuning ΔW has a low-rank structure; it only shows that within the LoRA family, the optimizer doesn't need the extra capacity. The 'intrinsic rank of ΔW' inference is partly circular because the object measured is already constrained to be low-rank by construction.

  2. fitted input called prediction [Section 7.3, Table 7]
    "We project W onto the r-dimensional subspace of ∆W by computing U⊤W V⊤, with U/V being the left/right singular-vector matrix of ∆W. ... ∆W has a stronger correlation with W compared to a random matrix, indicating that ∆W amplifies some features that are already in W. ... the amplification factor is rather huge: 21.5 ≈ 6.91/0.32 for r = 4."

    ΔW here is LoRA's BA, whose column space is r-dimensional by construction. The 21x amplification factor shows LoRA selects a small subspace not emphasized in W, but it cannot establish that full fine-tuning would select the same low-dimensional subspace, because no full-FT ΔW is computed for comparison. The 'rank-deficiency in language model adaptation' framing thus partly reduces to a property of the LoRA parameterization itself.

full rationale

The paper's central operational claim — that a rank-r additive update matches full fine-tuning on a range of NLU/NLG benchmarks — is supported by external benchmark numbers (Tables 2–4) and is not circular: the comparison is against published full-FT baselines on GLUE accuracy, BLEU, ROUGE, SQL accuracy, all independent of LoRA's parameterization. That part stands on its own. The mechanistic strand — repeated in the abstract ('empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA'), §1, §4.1, and §7 — has a bounded circularity issue. The §7.2 evidence (Table 6, Fig. 3) that 'a rank as small as one suffices' compares LoRA solutions at r=8 vs r=64; both are constrained to be low-rank by construction, so containment of A_{r=8}'s top directions inside A_{r=64} does not establish that the unconstrained ΔW = W_FT − W_0 has a fast-decaying spectrum. Similarly §7.3 / Table 7 reports a 21× amplification factor on the SVD basis of LoRA's BA — a basis r-dimensional by construction. The paper phrases these as statements about the LoRA update ('the update matrix ΔW could have a very small intrinsic rank') and cites Aghajanyan et al. (2020) as motivation rather than proof, so this is not a self-citation chain. The Aghajanyan citation is external (different authorship) and concerns intrinsic dimension of the optimization landscape, which is related to but not equivalent to rank of the weight delta — using it as motivation rather than as a load-bearing theorem is appropriate. No load-bearing claim rests on a same-author uniqueness theorem; comparisons use third-party baselines (Houlsby, Pfeiffer, Li & Liang) at matched parameter counts. Overall: minor circularity localized to the §7 mechanistic interpretation, not to the headline result. Score 3.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

LoRA introduces few free parameters of its own beyond the rank r and scaling α (and the architectural choice of which matrices to adapt). The method does not postulate new entities; it reparameterizes an existing object (the weight delta) under a low-rank constraint. The principal axiom is the empirical hypothesis that ΔW for adaptation is approximately low-rank on tasks close to the pretraining distribution.

pith-pipeline@v0.9.0 · 9739 in / 6081 out tokens · 96850 ms · 2026-05-08T23:23:20.909638+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

    cs.CR 2026-05 conditional novelty 8.0

    Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...

  2. PhysInOne: Visual Physics Learning and Reasoning in One Suite

    cs.CV 2026-04 unverdicted novelty 8.0

    PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and m...

  3. Inducing Artificial Uncertainty in Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.

  4. Efficient and Adaptive Human Activity Recognition via LLM Backbones

    cs.LG 2026-05 unverdicted novelty 7.0

    Pretrained LLMs adapted via convolutional projections and LoRA act as efficient frozen backbones for sensor-based human activity recognition, delivering strong data efficiency and cross-dataset transfer.

  5. Why Users Go There: World Knowledge-Augmented Generative Next POI Recommendation

    cs.AI 2026-05 unverdicted novelty 7.0

    AWARE augments generative next-POI recommendation with LLM agents that produce user-anchored narratives capturing events, culture, and trends, delivering up to 12.4% relative gains on three real datasets.

  6. Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

    cs.CV 2026-05 unverdicted novelty 7.0

    Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.

  7. Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases

    cs.LG 2026-05 unverdicted novelty 7.0

    ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.

  8. Reddit2Deezer: A Scalable Dataset for Real-World Grounded Conversational Music Recommendation

    cs.IR 2026-05 unverdicted novelty 7.0

    Reddit2Deezer supplies 190k authentic Reddit dialogues grounded in Deezer music entities for scalable conversational music recommendation research.

  9. KL for a KL: On-Policy Distillation with Control Variate Baseline

    cs.LG 2026-05 unverdicted novelty 7.0

    vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...

  10. MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning

    cs.CL 2026-05 unverdicted novelty 7.0

    MatryoshkaLoRA inserts a crafted diagonal matrix P into LoRA to learn accurate nested low-rank adapters that support dynamic rank selection with minimal performance drop.

  11. Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

    cs.LG 2026-05 unverdicted novelty 7.0

    Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.

  12. Dataset Watermarking for Closed LLMs with Provable Detection

    cs.LG 2026-05 unverdicted novelty 7.0

    A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tunin...

  13. Rethinking Vacuity for OOD Detection in Evidential Deep Learning

    cs.AI 2026-05 accept novelty 7.0

    Vacuity-based OOD detection in evidential deep learning is highly sensitive to class cardinality differences between ID and OOD, which can artificially inflate AUROC and AUPR without any change in model predictions.

  14. A Flow Matching Algorithm for Many-Shot Adaptation to Unseen Distributions

    cs.LG 2026-05 unverdicted novelty 7.0

    FP-FM adapts flow matching models to unseen distributions via least-squares projection onto basis functions spanning training velocity fields, yielding improved precision and recall without inference-time training.

  15. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.

  16. TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models

    cs.LG 2026-05 unverdicted novelty 7.0

    TFM-Retouche is an architecture-agnostic input-space residual adapter that improves tabular foundation model accuracy on 51 datasets by learning input corrections through the frozen backbone, with an identity guard to...

  17. A foundation model of vision, audition, and language for in-silico neuroscience

    q-bio.NC 2026-05 unverdicted novelty 7.0

    TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.

  18. VulKey: Automated Vulnerability Repair Guided by Domain-Specific Repair Patterns

    cs.CR 2026-05 unverdicted novelty 7.0

    VulKey reaches 31.5% repair accuracy on real C/C++ vulnerabilities by matching hierarchical expert patterns to guide LLM patch generation, beating prior baselines by 7.6%.

  19. Act2See: Emergent Active Visual Perception for Video Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.

  20. Directed Social Regard: Surfacing Targeted Advocacy, Opposition, Aid, Harms, and Victimization in Online Media

    cs.CL 2026-05 unverdicted novelty 7.0

    DSR uses transformer models to detect sentiment targets in text and score them along three theory-motivated axes, with validation showing correlations to existing social science datasets.

  21. Subliminal Steering: Stronger Encoding of Hidden Signals

    cs.CL 2026-04 unverdicted novelty 7.0

    Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.

  22. Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

    cs.LG 2026-04 unverdicted novelty 7.0

    A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and common...

  23. RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow

    cs.SE 2026-04 unverdicted novelty 7.0

    RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.

  24. Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales

    cs.LG 2026-04 unverdicted novelty 7.0

    High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perp...

  25. Fine-Tuning Small Reasoning Models for Quantum Field Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

  26. LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification

    cs.CL 2026-04 conditional novelty 7.0

    Fine-tuned BERTimbau-LoRA achieves 87.6% accuracy and 0.87 macro-F1 on LegalBench-BR, outperforming commercial LLMs by 22-28 points and eliminating their systematic bias toward civil law on Brazilian legal classification.

  27. AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

    cs.CV 2026-04 unverdicted novelty 7.0

    AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

  28. DocQAC: Adaptive Trie-Guided Decoding for Effective In-Document Query Auto-Completion

    cs.IR 2026-04 conditional novelty 7.0

    Adaptive trie-guided decoding with document context and tunable penalties improves in-document query auto-completion, outperforming baselines and larger models like LLaMA-3 on seen queries.

  29. How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

    cs.CL 2026-04 unverdicted novelty 7.0

    Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.

  30. BasketHAR: A Multimodal Dataset for Human Activity Recognition and Sport Analysis in Basketball Training Scenarios

    cs.CV 2026-04 conditional novelty 7.0

    BasketHAR is a publicly released multimodal dataset of professional basketball training activities captured with inertial sensors, physiological signals, and video, accompanied by a baseline alignment method.

  31. RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration

    cs.CL 2026-04 unverdicted novelty 7.0

    RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.

  32. S-GRPO: Unified Post-Training for Large Vision-Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.

  33. DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines

    cs.CV 2026-04 unverdicted novelty 7.0

    DharmaOCR models reach 0.925 and 0.911 extraction scores with 0.40% and 0.20% degeneration rates on a new benchmark covering printed, handwritten, and legal documents, outperforming open-source and commercial baseline...

  34. GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models

    cs.LG 2026-04 conditional novelty 7.0

    GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.

  35. SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    SLQ turns frozen MLLMs into retrievers via shared latent queries appended to inputs, outperforming fine-tuning on COCO and Flickr30K while introducing KARR-Bench for knowledge-aware evaluation.

  36. Speaker Attributed Automatic Speech Recognition Using Speech Aware LLMS

    eess.AS 2026-04 unverdicted novelty 7.0

    Adapting speech-aware LLMs with speaker cluster identification tags and concatenated multi-speaker data yields superior speaker-attributed ASR performance versus sequential diarization-plus-ASR pipelines.

  37. SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates

    cs.LG 2026-04 unverdicted novelty 7.0

    LoRA weight updates are spectrally sparse, with 33% of DCT coefficients capturing 90% of energy on average, enabling 10x storage reduction and occasional gains by masking high frequencies.

  38. FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

    cs.AI 2026-04 unverdicted novelty 7.0

    FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.

  39. UIPress: Bringing Optical Token Compression to UI-to-Code Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...

  40. Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification

    cs.CV 2026-04 unverdicted novelty 7.0

    Medical MLLMs degrade on image classification due to four failure modes in visual representation quality, connector projection fidelity, LLM comprehension, and semantic mapping alignment, quantified by feature probing...

  41. DiV-INR: Extreme Low-Bitrate Diffusion Video Compression with INR Conditioning

    eess.IV 2026-04 unverdicted novelty 7.0

    DiV-INR integrates implicit neural representations as conditioning signals for diffusion models to achieve better perceptual quality than HEVC, VVC, and prior neural codecs at extremely low bitrates under 0.05 bpp.

  42. HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.

  43. Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents

    cs.SE 2026-04 unverdicted novelty 7.0

    A LoRA-fine-tuned Qwen 3.5 2B model for task-conditioned tool-output pruning reaches 0.86 recall and 0.80 F1 on a new 618-example test set while removing 92% of input tokens and outperforming larger zero-shot models.

  44. S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

    cs.CL 2026-04 conditional novelty 7.0

    S0 tuning optimizes initial recurrent states in hybrid models to outperform LoRA with zero inference cost on HumanEval and partial cross-domain transfer.

  45. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  46. 3D-VLA: A 3D Vision-Language-Action Generative World Model

    cs.CV 2024-03 unverdicted novelty 7.0

    3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.

  47. ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    cs.CV 2024-03 unverdicted novelty 7.0

    ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.

  48. Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    cs.CV 2023-10 unverdicted novelty 7.0

    A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

  49. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    cs.CV 2023-07 unverdicted novelty 7.0

    A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.

  50. Extending Context Window of Large Language Models via Positional Interpolation

    cs.CL 2023-06 conditional novelty 7.0

    Position Interpolation linearly down-scales position indices to extend RoPE context windows to 32768 tokens with 1000-step fine-tuning, delivering strong long-context results on LLaMA 7B-65B while preserving short-con...

  51. QLoRA: Efficient Finetuning of Quantized LLMs

    cs.LG 2023-05 conditional novelty 7.0

    QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

  52. Combining pre-trained models via localized model averaging

    stat.ME 2026-05 unverdicted novelty 6.0

    Localized model averaging with covariate-dependent weights achieves asymptotic optimality and weight consistency for combining pre-trained models under a general loss framework.

  53. LoREnc: Low-Rank Encryption for Securing Foundation Models and LoRA Adapters

    cs.CR 2026-05 unverdicted novelty 6.0

    LoREnc secures foundation models and adapters by truncating dominant low-rank components and compensating only in authorized adapters, causing unauthorized outputs to collapse while authorized performance remains exact.

  54. Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.

  55. Early Data Exposure Improves Robustness to Subsequent Fine-Tuning

    cs.LG 2026-05 conditional novelty 6.0

    Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.

  56. Output Composability of QLoRA PEFT Modules for Plug-and-Play Attribute-Controlled Text Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    Summing outputs from separately trained QLoRA PEFT modules provides strong performance for attribute-controlled text generation, often matching or exceeding single-task modules even on single-attribute tests.

  57. EDGER: EDge-Guided with HEatmap Refinement for Generalizable Image Forgery Localization

    cs.CV 2026-05 unverdicted novelty 6.0

    A dual-branch system using frequency edge cues and CLIP-based synthetic patch detection for accurate, resolution-independent image forgery localization.

  58. Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    TABOM models inference unmasking preferences as a Boltzmann distribution over predictive entropies and derives a ranking loss to align DLM training with observed trajectories, yielding gains in new domains and reduced...

  59. GRAFT: Graph-Tokenized LLMs for Tool Planning

    cs.LG 2026-05 unverdicted novelty 6.0

    GRAFT internalizes tool dependency graphs via dedicated special tokens in LLMs and applies on-policy context distillation to achieve higher exact sequence matching and dependency legality than prior external-graph methods.

  60. Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 241 Pith papers · 4 internal anchors

  1. [1]

    arXiv preprint arXiv:2012.13255 , year=

    Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine - Tuning . arXiv:2012.13255 [cs], December 2020. URL http://arxiv.org/abs/2012.13255

  2. [2]

    What Can ResNet Learn Efficiently, Going Beyond Kernels? In NeurIPS, 2019

    Zeyuan Allen-Zhu and Yuanzhi Li. What Can ResNet Learn Efficiently, Going Beyond Kernels? In NeurIPS, 2019. Full version available at http://arxiv.org/abs/1905.10337

  3. [3]

    Backward feature correction: How deep learning performs deep learning

    Zeyuan Allen-Zhu and Yuanzhi Li. Backward feature correction: How deep learning performs deep learning. arXiv preprint arXiv:2001.04413, 2020 a

  4. [4]

    Feature purification: How adversarial training performs robust deep learning

    Zeyuan Allen-Zhu and Yuanzhi Li. Feature purification: How adversarial training performs robust deep learning. arXiv preprint arXiv:2005.10190, 2020 b

  5. [5]

    2018 , month = nov, journal =

    Zeyuan Allen-Zhu , Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In ICML, 2019. Full version available at http://arxiv.org/abs/1811.03962

  6. [6]

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016

  7. [7]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  8. [8]

    A singular value thresholding algorithm for matrix completion

    Jian-Feng Cai, Emmanuel J Cand \`e s, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization, 20 0 (4): 0 1956--1982, 2010

  9. [9]

    S em E val-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

    Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 2017. doi:10.18653/v1/s17-2001. URL http://dx.doi.org/10.18653/v1/S17-2001

  10. [10]

    A unified architecture for natural language processing: deep neural networks with multitask learning

    Ronan Collobert and Jason Weston. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning , ICML '08, pp.\ 160--167, New York, NY, USA, July 2008. Association for Computing Machinery. ISBN 978-1-60558-205-4. doi:10.1145/1390156.1390177. UR...

  11. [11]

    Predicting parameters in deep learning, 2014

    Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, and Nando de Freitas. Predicting parameters in deep learning, 2014

  12. [12]

    Bert: Pre-training of deep bidirectional transformers for language understanding, 2019 a

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019 a

  13. [13]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre -training of Deep Bidirectional Transformers for Language Understanding . arXiv:1810.04805 [cs], May 2019 b . URL http://arxiv.org/abs/1810.04805. arXiv: 1810.04805

  14. [14]

    Dolan and Chris Brockett

    William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing ( IWP 2005) , 2005. URL https://aclanthology.org/I05-5002

  15. [15]

    The webnlg challenge: Generating text from rdf data

    Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. The webnlg challenge: Generating text from rdf data. In Proceedings of the 10th International Conference on Natural Language Generation, pp.\ 124--133, 2017

  16. [16]

    When do neural networks outperform kernel methods? arXiv preprint arXiv:2006.13409, 2020

    Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural networks outperform kernel methods? arXiv preprint arXiv:2006.13409, 2020

  17. [17]

    Samsum corpus: A human- annotated dialogue dataset for abstractive summarization

    Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. CoRR, abs/1911.12237, 2019. URL http://arxiv.org/abs/1911.12237

  18. [18]

    A literature survey of low-rank tensor approximation techniques

    Lars Grasedyck, Daniel Kressner, and Christine Tobler. A literature survey of low-rank tensor approximation techniques. GAMM-Mitteilungen, 36 0 (1): 0 53--78, 2013

  19. [19]

    Jihun Ham and Daniel D. Lee. Grassmann discriminant analysis: a unifying view on subspace-based learning. In ICML, pp.\ 376--383, 2008. URL https://doi.org/10.1145/1390156.1390204

  20. [20]

    WARP : Word -level Adversarial ReProgramming

    Karen Hambardzumyan, Hrant Khachatrian, and Jonathan May. WARP : Word -level Adversarial ReProgramming . arXiv:2101.00121 [cs], December 2020. URL http://arxiv.org/abs/2101.00121. arXiv: 2101.00121

  21. [21]

    Deberta: Decoding-enhanced bert with disentangled attention, 2021

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention, 2021

  22. [22]

    Parameter- Efficient Transfer Learning for NLP .arXiv2019, arXiv:1902.00751

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter- Efficient Transfer Learning for NLP . arXiv:1902.00751 [cs, stat], June 2019. URL http://arxiv.org/abs/1902.00751

  23. [23]

    Speeding up convo- lutional neural networks with low rank expansions,

    Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014

  24. [24]

    Initialization and regularization of factorized neural layers, 2021

    Mikhail Khodak, Neil Tenenholtz, Lester Mackey, and Nicolò Fusi. Initialization and regularization of factorized neural layers, 2021

  25. [25]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017

  26. [26]

    Gshard: Scaling giant models with conditional computation and automatic sharding, 2020

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding, 2020

  27. [27]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The Power of Scale for Parameter - Efficient Prompt Tuning . arXiv:2104.08691 [cs], April 2021. URL http://arxiv.org/abs/2104.08691. arXiv: 2104.08691

  28. [28]

    Measuring the intrinsic dimension of objective landscapes,

    Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the Intrinsic Dimension of Objective Landscapes . arXiv:1804.08838 [cs, stat], April 2018 a . URL http://arxiv.org/abs/1804.08838. arXiv: 1804.08838

  29. [29]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Xiang Lisa Li and Percy Liang. Prefix- Tuning : Optimizing Continuous Prompts for Generation . arXiv:2101.00190 [cs], January 2021. URL http://arxiv.org/abs/2101.00190

  30. [30]

    Learning overparameterized neural networks via stochastic gradient descent on structured data

    Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, 2018

  31. [31]

    Recovery guarantee of weighted low-rank approximation via alternating minimization

    Yuanzhi Li, Yingyu Liang, and Andrej Risteski. Recovery guarantee of weighted low-rank approximation via alternating minimization. In International Conference on Machine Learning, pp.\ 2358--2367. PMLR, 2016

  32. [32]

    Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations

    Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pp.\ 2--47. PMLR, 2018 b

  33. [33]

    Exploring Versatile Generative Language Model Via Parameter-Efficient Transfer Learning

    Zhaojiang Lin, Andrea Madotto, and Pascale Fung. Exploring versatile generative language model via parameter-efficient transfer learning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.\ 441--459, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.41. URL https://aclanthology...

  34. [34]

    Gpt understands, too,

    Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. GPT Understands , Too . arXiv:2103.10385 [cs], March 2021. URL http://arxiv.org/abs/2103.10385. arXiv: 2103.10385

  35. [35]

    Roberta: A robustly optimized bert pretraining approach, 2019

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019

  36. [36]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  37. [37]

    Decoupled weight decay regularization, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

  38. [38]

    Compacter: Efficient low-rank hypercomplex adapter layers, 2021

    Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers, 2021

  39. [39]

    Dart: Open-domain structured data record to text generation

    Linyong Nan, Dragomir Radev, Rui Zhang, Amrit Rau, Abhinand Sivaprasad, Chiachun Hsieh, Xiangru Tang, Aadit Vyas, Neha Verma, Pranav Krishna, et al. Dart: Open-domain structured data record to text generation. arXiv preprint arXiv:2007.02871, 2020

  40. [40]

    The e2e dataset: New challenges for end-to-end generation.arXiv preprint arXiv:1706.09254, 2017

    Jekaterina Novikova, Ond r ej Du s ek, and Verena Rieser. The e2e dataset: New challenges for end-to-end generation. arXiv preprint arXiv:1706.09254, 2017

  41. [41]

    Generalization guarantees for neural networks via harnessing the low-rank structure of the jacobian

    Samet Oymak, Zalan Fabian, Mingchen Li, and Mahdi Soltanolkotabi. Generalization guarantees for neural networks via harnessing the low-rank structure of the jacobian. arXiv preprint arXiv:1906.05392, 2019

  42. [42]

    Adapterfusion: Non-destructive task composition for transfer learning, 2021

    Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning, 2021

  43. [43]

    Semi-orthogonal low-rank matrix factorization for deep neural networks

    Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li, Hainan Xu, Mahsa Yarmohammadi, and Sanjeev Khudanpur. Semi-orthogonal low-rank matrix factorization for deep neural networks. In Interspeech, pp.\ 3743--3747, 2018

  44. [44]

    Improving Language Understanding by Generative Pre - Training

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving Language Understanding by Generative Pre - Training . pp.\ 12, a

  45. [45]

    Language Models are Unsupervised Multitask Learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners . pp.\ 24, b

  46. [46]

    Know What You Don't Know: Unanswerable Questions for SQuAD

    Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don't know: Unanswerable questions for squad. CoRR, abs/1806.03822, 2018. URL http://arxiv.org/abs/1806.03822

  47. [47]

    Learning multiple visual domains with residual adapters

    Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. arXiv:1705.08045 [cs, stat], November 2017. URL http://arxiv.org/abs/1705.08045. arXiv: 1705.08045

  48. [48]

    Adapterdrop: On the efficiency of adapters in transformers, 2020

    Andreas Rücklé, Gregor Geigle, Max Glockner, Tilman Beck, Jonas Pfeiffer, Nils Reimers, and Iryna Gurevych. Adapterdrop: On the efficiency of adapters in transformers, 2020

  49. [49]

    Low-rank matrix factorization for deep neural network training with high-dimensional output targets

    Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE international conference on acoustics, speech and signal processing, pp.\ 6655--6659. IEEE, 2013

  50. [50]

    Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

  51. [51]

    Manning, Andrew Ng, and Christopher Potts

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.\ 1631--1642, Seattle, Washington, USA, October 2013. Association for C...

  52. [52]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp.\ 6000--6010, 2017

  53. [53]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019

  54. [54]

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems, 2020

  55. [55]

    Neural network acceptability judgments

    Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471, 2018

  56. [56]

    doi: 10.18653/v1/N18-1101

    Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pp.\ 1112--1122, New Orleans, Louisiana, June 2018. As...

  57. [57]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

  58. [58]

    Greg Yang and Edward J. Hu. Feature Learning in Infinite - Width Neural Networks . arXiv:2011.14522 [cond-mat], May 2021. URL http://arxiv.org/abs/2011.14522. arXiv: 2011.14522

  59. [59]

    Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2021

    Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2021

  60. [60]

    Extracting deep neural network bottleneck features using low-rank matrix factorization

    Yu Zhang, Ekapol Chuangsuwanich, and James Glass. Extracting deep neural network bottleneck features using low-rank matrix factorization. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.\ 185--189. IEEE, 2014

  61. [61]

    Low-rank plus diagonal adaptation for deep neural networks

    Yong Zhao, Jinyu Li, and Yifan Gong. Low-rank plus diagonal adaptation for deep neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 5005--5009. IEEE, 2016

  62. [62]

    Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

    Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103, 2017. URL http://arxiv.org/abs/1709.00103

  63. [63]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  64. [64]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  65. [65]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  66. [66]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...