pith. machine review for the scientific record. sign in

arxiv: 2604.19488 · v1 · submitted 2026-04-21 · 💻 cs.AI

Recognition: unknown

CoDA: Towards Effective Cross-domain Knowledge Transfer via CoT-guided Domain Adaptation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:07 UTC · model grok-4.3

classification 💻 cs.AI
keywords cross-domain adaptationchain-of-thoughtlarge language modelslogical reasoningknowledge transfermaximum mean discrepancyin-context learningdomain shift
0
0 comments X

The pith

A lightweight adapter aligns latent reasoning patterns across domains in LLMs by distilling CoT-enriched representations and matching them with MMD.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models perform well on logical reasoning when given good in-domain examples but often lack such examples in specialized or low-resource fields. Cross-domain samples can be retrieved as substitutes, yet domain shift usually blocks the model from extracting and using shared reasoning structures. CoDA addresses this by inserting a lightweight adapter that operates on intermediate hidden states rather than raw text prompts. The adapter distills chain-of-thought enriched reference features from the source domain and applies maximum mean discrepancy to bring the source and target latent distributions into closer alignment. Experiments across multiple reasoning tasks and model families show this yields large gains over prior cross-domain prompting baselines.

Core claim

CoDA employs a lightweight adapter to directly intervene in the intermediate hidden states of LLMs. By combining feature-based distillation of CoT-enriched reference representations with Maximum Mean Discrepancy for kernelized distribution matching, the method aligns the latent reasoning representation of the source and target domains, allowing more effective transfer of cross-domain knowledge for logical reasoning.

What carries the argument

Lightweight adapter that intervenes in intermediate hidden states through CoT-enriched feature distillation plus MMD-based distribution matching to align source and target reasoning representations.

If this is right

  • Cross-domain samples become usable as surrogate in-context demonstrations for logical reasoning tasks.
  • Performance improves substantially in expertise-scarce domains such as certain scientific or legal fields.
  • The gains hold across different LLM families on multiple reasoning benchmarks.
  • The method outperforms previous state-of-the-art cross-domain baselines by a large margin.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Reasoning patterns may be partially separable from surface domain content at the level of hidden states.
  • Similar hidden-state adapters could be tested on non-reasoning transfer tasks such as generation or classification.
  • The approach raises the possibility that targeted interventions inside the model can substitute for ever-larger context windows or retrieval systems.

Load-bearing premise

Pronounced domain shift between source and target can be overcome by aligning latent reasoning patterns that exist in intermediate hidden states and can be captured via CoT distillation and MMD.

What would settle it

If applying the CoDA adapter produces no measurable reduction in domain discrepancy metrics or no accuracy gains over raw cross-domain prompting on held-out logical reasoning benchmarks, the alignment approach would be shown ineffective.

Figures

Figures reproduced from arXiv: 2604.19488 by Buzhou Tang, Dongning Sun, Jianzhi Yan, Le Liu, Yang Xiang, Zhiming Li.

Figure 1
Figure 1. Figure 1: The overall framework of CoDA. (A) Adapted Representation Extraction: A frozen LLM extracts original hidden states (hS, hT ) and the CoT-infused teacher state (h ∗ S ). A lightweight non-linear adapter is then employed to map these states into a shared reasoning-enhanced space. (B) CoT-guided Domain Adaptation: The adapter is optimized using a dual-objective: MSE loss distills the reasoning trajectory by m… view at source ↗
Figure 3
Figure 3. Figure 3: t-SNE visualization of latent representations. Compared to the unaligned [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Parameter efficiency vs. OOD accuracy. Gray arrows highlight [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise intervention performance. Accuracy of latent intervention [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Large language models (LLMs) have achieved substantial advances in logical reasoning, yet they continue to lag behind human-level performance. In-context learning provides a viable solution that boosts the model's performance via prompting its input with expert-curated, in-domain exemplars. However, in many real-world, expertise-scarce domains, such as low-resource scientific disciplines, emerging biomedical subfields, or niche legal jurisdictions, such high-quality in-domain demonstrations are inherently limited or entirely unavailable, thereby constraining the general applicability of these approaches. To mitigate this limitation, recent efforts have explored the retrieval of cross-domain samples as surrogate in-context demonstrations. Nevertheless, the resulting gains remain modest. This is largely attributable to the pronounced domain shift between source and target distributions, which impedes the model's ability to effectively identify and exploit underlying shared structures or latent reasoning patterns. Consequently, when relying solely on raw textual prompting, LLMs struggle to abstract and transfer such cross-domain knowledge in a robust and systematic manner. To address these issues, we propose CoDA, which employs a lightweight adapter to directly intervene in the intermediate hidden states. By combining feature-based distillation of CoT-enriched reference representations with Maximum Mean Discrepancy (MMD) for kernelized distribution matching, our method aligns the latent reasoning representation of the source and target domains. Extensive experimental results on multiple logical reasoning tasks across various model families validate the efficacy of CoDA by significantly outperforming the previous state-of-the-art baselines by a large margin.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes CoDA, a cross-domain adaptation technique for LLMs on logical reasoning tasks. It introduces a lightweight adapter that intervenes directly on intermediate hidden states, combining feature-based distillation of Chain-of-Thought (CoT) enriched source representations with Maximum Mean Discrepancy (MMD) kernel matching to align latent reasoning distributions between source and target domains. The central claim is that this alignment enables robust transfer of underlying reasoning patterns, yielding large-margin gains over prior SOTA baselines across multiple tasks and model families in low-resource domains lacking in-domain exemplars.

Significance. If the alignment mechanism demonstrably transfers step-by-step logical structures rather than merely reducing surface-level distributional mismatch, the work could meaningfully extend in-context learning to expertise-scarce domains such as emerging biomedical or niche legal fields, moving beyond retrieval-based prompting.

major comments (2)
  1. [Abstract] Abstract: The assertion that CoDA 'significantly outperform[s] the previous state-of-the-art baselines by a large margin' supplies no quantitative metrics, baseline names, task-specific results, ablation controls, or statistical significance tests, making the central empirical claim impossible to evaluate from the summary alone and requiring full experimental tables for assessment.
  2. [Method] Method description (abstract and implied §3): The claim that MMD on CoT-distilled hidden states 'aligns the latent reasoning representation' rests on the untested assumption that shared reasoning structures exist and are capturable at the level of kernelized activation matching; without ablations isolating the CoT distillation component, error analysis on reasoning-step fidelity, or probing for logical transfer (vs. generic regularization), it remains unclear whether gains reflect reasoning alignment or superficial statistic matching.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'CoT-enriched reference representations' is used without an inline definition or citation to the specific Chain-of-Thought prompting protocol employed for the source references.
  2. [Abstract] Abstract: 'Large margin' is repeated without anchoring to concrete effect sizes or comparison tables that would appear in the experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below, clarifying the manuscript's contributions and outlining targeted revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that CoDA 'significantly outperform[s] the previous state-of-the-art baselines by a large margin' supplies no quantitative metrics, baseline names, task-specific results, ablation controls, or statistical significance tests, making the central empirical claim impossible to evaluate from the summary alone and requiring full experimental tables for assessment.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative support. In the revised version, we will incorporate specific metrics (e.g., average accuracy gains and baseline names across the main tasks) while referencing the full results tables and statistical tests in Section 4. This change will make the central claim directly evaluable from the abstract without exceeding length constraints. revision: yes

  2. Referee: [Method] Method description (abstract and implied §3): The claim that MMD on CoT-distilled hidden states 'aligns the latent reasoning representation' rests on the untested assumption that shared reasoning structures exist and are capturable at the level of kernelized activation matching; without ablations isolating the CoT distillation component, error analysis on reasoning-step fidelity, or probing for logical transfer (vs. generic regularization), it remains unclear whether gains reflect reasoning alignment or superficial statistic matching.

    Authors: The manuscript already contains ablations in Section 4.3 that isolate the CoT distillation component, demonstrating that its removal leads to substantial performance drops relative to the full CoDA model. Qualitative examples of transferred reasoning chains are also provided in the appendix. We acknowledge, however, that additional quantitative probing for step-level logical fidelity versus generic regularization would provide stronger evidence. We will add a new analysis subsection with error categorization on reasoning steps and a controlled comparison against a non-CoT MMD baseline. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture with no self-referential derivations or load-bearing self-citations

full rationale

The paper presents CoDA as a practical method combining a lightweight adapter, feature-based distillation of CoT-enriched representations, and MMD for distribution alignment. No equations, derivations, or parameter-fitting steps are described that reduce the claimed cross-domain transfer gains to inputs by construction. The abstract and method description treat the components as independent design choices validated empirically, without invoking uniqueness theorems, renaming known results, or self-citation chains for the core premise. This matches the reader's assessment of an empirical architecture lacking self-referential structure. The central claim remains open to external validation via experiments rather than being tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on learned adapter parameters and the assumption of transferable latent structures; no new physical entities or ungrounded constants are introduced.

free parameters (2)
  • adapter parameters
    Lightweight module weights learned to intervene in hidden states; fitted during training.
  • MMD kernel hyperparameters
    Bandwidth and kernel choice for distribution matching; selected to align source-target representations.
axioms (1)
  • domain assumption Source and target domains share latent reasoning patterns that can be aligned in hidden states
    Invoked to justify why MMD and CoT distillation will transfer knowledge effectively.

pith-pipeline@v0.9.0 · 5576 in / 1167 out tokens · 34218 ms · 2026-05-10T02:07:22.281093+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 19 canonical work pages · 9 internal anchors

  1. [1]

    A Survey of Large Language Models

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Donget al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, vol. 1, no. 2, pp. 1–124, 2023

  2. [2]

    Mathprompter: Mathematical reasoning using large language models,

    S. Imani, L. Du, and H. Shrivastava, “Mathprompter: Mathematical reasoning using large language models,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), 2023, pp. 37–42

  3. [3]

    Towards reasoning in large language models: A survey,

    J. Huang and K. C.-C. Chang, “Towards reasoning in large language models: A survey,” inFindings of the association for computational linguistics: ACL 2023, 2023, pp. 1049–1065

  4. [4]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  5. [5]

    Towards robust deep reinforcement learning-based quantitative trading with neuro-symbolic trend analysis,

    J. Jiang, Z. Li, A. Cui, X. Zhou, B. Li, and D. Sun, “Towards robust deep reinforcement learning-based quantitative trading with neuro-symbolic trend analysis,”Neural Networks, vol. 201, p. 108924,

  6. [6]

    Available: https://www.sciencedirect.com/science/article/ pii/S0893608026003850

    [Online]. Available: https://www.sciencedirect.com/science/article/ pii/S0893608026003850

  7. [7]

    Investigating training data detection in ai coders,

    T. Li, Y . Wei, Z. Li, A. Liu, Q. Guo, X. Liu, D. Sun, and Y . Liu, “Investigating training data detection in ai coders,” 2025. [Online]. Available: https://arxiv.org/abs/2507.17389

  8. [8]

    Logic-q: Improving deep reinforcement learning-based quantitative trading via program sketch-based tuning,

    Z. Li, J. Jiang, Y . Cao, A. Cui, B. Wu, B. Li, Y . Liu, and D. D. Sun, “Logic-q: Improving deep reinforcement learning-based quantitative trading via program sketch-based tuning,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 17, pp. 18 584–18 592, Apr. 2025. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/ article/v...

  9. [9]

    Language models are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  10. [10]

    In-context learning with retrieved demonstrations for language models: A survey,

    M. Luo, X. Xu, Y . Liu, P. Pasupat, and M. Kazemi, “In-context learning with retrieved demonstrations for language models: A survey,”arXiv preprint arXiv:2401.11624, 2024

  11. [11]

    Learning to retrieve prompts for in-context learning,

    O. Rubin, J. Herzig, and J. Berant, “Learning to retrieve prompts for in-context learning,” inProceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies, 2022, pp. 2655–2671

  12. [12]

    Learning to retrieve in-context examples for large language models,

    L. Wang, N. Yang, and F. Wei, “Learning to retrieve in-context examples for large language models,” inProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 1752–1767

  13. [13]

    Uncertainty quantification for in-context learning of large language models,

    C. Ling, X. Zhao, X. Zhang, W. Cheng, Y . Liu, Y . Sun, M. Oishi, T. Osaki, K. Matsuda, J. Jiet al., “Uncertainty quantification for in-context learning of large language models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024, p...

  14. [14]

    Reasoning graph enhanced exemplars retrieval for in-context learning,

    Y . Lin, B. Zhong, S. Jiang, J. Siebert, and Q. Chen, “Reasoning graph enhanced exemplars retrieval for in-context learning,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 9737–9759

  15. [15]

    Improving neural logic machines via failure reflection,

    Z. Li, Y . Cao, Y . ZHENG, X. Liu, B. Wu, T. Li, X. Xu, J. Jiang, Y . S. Teo, S.-W. Lin, and Y . Liu, “Improving neural logic machines via failure reflection,” inForty-first International Conference on Machine Learning,

  16. [16]

    Available: https://openreview.net/forum?id=JObct1zyTb

    [Online]. Available: https://openreview.net/forum?id=JObct1zyTb

  17. [17]

    Fedmut: Generalized federated learning via stochastic mutation,

    M. Hu, Y . Cao, A. Li, Z. Li, C. Liu, T. Li, M. Chen, and Y . Liu, “Fedmut: Generalized federated learning via stochastic mutation,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, pp. 12 528–12 537, Mar. 2024. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/29146

  18. [18]

    Learning program semantics for vulnerability detection via vulnerability-specific inter-procedural slicing,

    B. Wu, S. Liu, Y . Xiao, Z. Li, J. Sun, and S.-W. Lin, “Learning program semantics for vulnerability detection via vulnerability-specific inter-procedural slicing,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2023. New York, NY , USA: Association for C...

  19. [19]

    FAIRER: Fairness as decision rationale alignment,

    T. Li, Q. Guo, A. Liu, M. Du, Z. Li, and Y . Liu, “FAIRER: Fairness as decision rationale alignment,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–29 Jul 2023, pp. 19 471–19 489. [On...

  20. [20]

    Reason Analogically via Cross-domain Prior Knowledge: An Empirical Study of Cross-domain Knowledge Transfer for In-Context Learning

    L. Liu, Z. Li, J. Yan, Z. Yuan, S. Chen, Y . Pan, B. Tang, Q. Chen, Y . Xiang, and D. D. Sun, “Reason analogically via cross-domain prior knowledge: An empirical study of cross-domain knowledge transfer for in-context learning,” 2026. [Online]. Available: https://arxiv.org/abs/2604.05396

  21. [21]

    The shape of reasoning: Topological analysis of reasoning traces in large language models.arXiv preprint arXiv:2510.20665, 2025

    X. W. Tan, N. Tan, G. Lee, and S. Kok, “The shape of reasoning: Topological analysis of reasoning traces in large language models,”arXiv preprint arXiv:2510.20665, 2025

  22. [22]

    Towards Effective In-context Cross-domain Knowledge Transfer via Domain-invariant-neurons-based Retrieval

    J. Yan, Z. Li, L. Liu, Z. Yuan, S. Chen, Y . Pan, B. Tang, Y . Xiang, and D. D. Sun, “Towards effective in-context cross-domain knowledge transfer via domain-invariant-neurons-based retrieval,” 2026. [Online]. Available: https://arxiv.org/abs/2604.05383

  23. [23]

    Large language models can be lazy learners: Analyze shortcuts in in-context learning,

    R. Tang, D. Kong, L. Huanget al., “Large language models can be lazy learners: Analyze shortcuts in in-context learning,” inFindings of the association for computational linguistics: ACL 2023, 2023, pp. 4645–4657

  24. [24]

    Exploring and mitigating shortcut learning for generative large language models,

    Z. Sun, Y . Xiao, J. Li, Y . Ji, W. Chen, and M. Zhang, “Exploring and mitigating shortcut learning for generative large language models,” in Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024), 2024, pp. 6883–6893

  25. [25]

    Examining the robustness of llm evaluation to the distributional assumptions of benchmarks,

    C. Siska, K. Marazopoulou, M. Ailem, and J. Bono, “Examining the robustness of llm evaluation to the distributional assumptions of benchmarks,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 10 406–10 421

  26. [26]

    Graph of thoughts: Solving elaborate problems with large language models,

    M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyket al., “Graph of thoughts: Solving elaborate problems with large language models,” inProceedings of the AAAI conference on artificial intelligence, vol. 38, no. 16, 2024, pp. 17 682–17 690

  27. [27]

    Demystifying chains, trees, and graphs of thoughts,

    M. Besta, F. Memedi, Z. Zhang, R. Gerstenberger, G. Piao, N. Blach, P. Nyczyk, M. Copik, G. Kwa ´sniewski, J. Mülleret al., “Demystifying chains, trees, and graphs of thoughts,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  28. [28]

    Enhanced data synthesis for llm through reasoning structures generated by hierarchical gflownet,

    T. Bu, M. Zhang, H. Duan, S. Li, L. Hu, and Y . Li, “Enhanced data synthesis for llm through reasoning structures generated by hierarchical gflownet,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 15 931–15 958

  29. [29]

    arXiv preprint arXiv:2408.09172 , year=

    H.-Y . Huang, Z. Wu, Y . Yang, J. Zhang, and Y . Wu, “Unlocking the power of llm uncertainty for active in-context example selection,”arXiv preprint arXiv:2408.09172, 2024

  30. [30]

    Using natural language explanations to improve robustness of in-context learning,

    X. He, Y . Wu, O.-M. Camburu, P. Minervini, and P. Stenetorp, “Using natural language explanations to improve robustness of in-context learning,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 13 477– 13 499

  31. [31]

    A kernel two-sample test,

    A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel two-sample test,”The journal of machine learning research, vol. 13, no. 1, pp. 723–773, 2012

  32. [32]

    A survey on transfer learning,

    S. J. Pan and Q. Yang, “A survey on transfer learning,”IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009

  33. [33]

    A survey of multi-source domain adaptation,

    S. Sun, H. Shi, and Y . Wu, “A survey of multi-source domain adaptation,” Information Fusion, vol. 24, pp. 84–92, 2015

  34. [34]

    A brief review of domain adaptation,

    A. Farahani, S. V oghoei, K. Rasheed, and H. R. Arabnia, “A brief review of domain adaptation,”Advances in data science and information engineering: proceedings from ICDATA 2020 and IKE 2020, pp. 877–894, 2021

  35. [35]

    Cot vectors: Transfer- ring and probing the reasoning mechanisms of llms,

    L. Li, Z. Wang, Y . Wu, J. Cai, and X. Yang, “Cot vectors: Transfer- ring and probing the reasoning mechanisms of llms,”arXiv preprint arXiv:2510.00579, 2025

  36. [36]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakanoet al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021

  37. [37]

    Non-interactive symbolic-aided chain-of-thought for logical reasoning,

    P. M. Nguyen, T. Dang, and N. Inoue, “Non-interactive symbolic-aided chain-of-thought for logical reasoning,” inProceedings of the 39th Pacific Asia Conference on Language, Information and Computation, 2025, pp. 329–340

  38. [38]

    Folio: Natural language reasoning with first-order logic,

    S. Han, H. Schoelkopf, Y . Zhao, Z. Qi, M. Riddell, W. Zhou, J. Coady, D. Peng, Y . Qiao, L. Bensonet al., “Folio: Natural language reasoning with first-order logic,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 22 017– 22 031

  39. [39]

    Proofwriter: Generating implications, proofs, and abductive statements over natural language,

    O. Tafjord, B. Dalvi, and P. Clark, “Proofwriter: Generating implications, proofs, and abductive statements over natural language,” inFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 3621–3634

  40. [40]

    Commonsenseqa: A question answering challenge targeting commonsense knowledge,

    A. Talmor, J. Herzig, N. Lourie, and J. Berant, “Commonsenseqa: A question answering challenge targeting commonsense knowledge,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4149–4158

  41. [41]

    Gemma 3 technical report,

    G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, M. Rivière, L. Rouillard, T. Mesnard, G. Cideron, J. bastien Grill, S. Ramos, E. Yvinec, M. Casbon, E. Pot, I. Penchev, G. Liu, F. Visin, K. Kenealy, L. Beyer, X. Zhai, A. Tsitsulin, R. Busa-Fekete, A. Feng, N. Sachdeva, B. Coleman, Y . Gao, B. Mustafa,...

  42. [42]

    Gemma 3 Technical Report

    [Online]. Available: https://arxiv.org/abs/2503.19786

  43. [43]

    Qwen2.5: A party of foundation models,

    Q. Team, “Qwen2.5: A party of foundation models,” September 2024. [Online]. Available: https://qwenlm.github.io/blog/qwen2.5/

  44. [44]

    Robertson and H

    S. Robertson and H. Zaragoza,The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc, 2009, vol. 4

  45. [45]

    Retrieval- augmented generation for knowledge-intensive nlp tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

  46. [46]

    Revisiting demonstration selection strategies in in-context learning,

    K. Peng, L. Ding, Y . Yuan, X. Liu, M. Zhang, Y . Ouyang, and D. Tao, “Revisiting demonstration selection strategies in in-context learning,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 9090– 9101

  47. [47]

    Lora: Low-rank adaptation of large language models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” Iclr, vol. 1, no. 2, p. 3, 2022

  48. [48]

    P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks,

    X. Liu, K. Ji, Y . Fu, W. Tam, Z. Du, Z. Yang, and J. Tang, “P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2022, pp. 61–68

  49. [49]

    Steering Llama 2 via Contrastive Activation Addition

    N. Panickssery, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. M. Turner, “Steering llama 2 via contrastive activation addition,”arXiv preprint arXiv:2312.06681, 2023

  50. [50]

    Adam: A method for stochastic optimization,

    K. Diederik, “Adam: A method for stochastic optimization,”(No Title), 2014

  51. [51]

    C- pack: Packed resources for general chinese embeddings,

    S. Xiao, Z. Liu, P. Zhang, N. Muennighoff, D. Lian, and J.-Y . Nie, “C- pack: Packed resources for general chinese embeddings,” inProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, 2024, pp. 641–649

  52. [52]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,” JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12 inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Bangkok, Thailand: Association f...

  53. [53]

    Visualizing data using t-sne

    L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.”Journal of machine learning research, vol. 9, no. 11, 2008

  54. [54]

    Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,

    P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,”Journal of computational and applied mathematics, vol. 20, pp. 53–65, 1987

  55. [55]

    A test metric for assessing single-cell rna-seq batch correction,

    M. Büttner, Z. Miao, F. A. Wolf, S. A. Teichmann, and F. J. Theis, “A test metric for assessing single-cell rna-seq batch correction,”Nature methods, vol. 16, no. 1, pp. 43–49, 2019

  56. [56]

    Adaptation odyssey in llms: Why does additional pretraining sometimes fail to improve?

    F. Öncel, M. Bethge, B. Ermis, M. Ravanelli, C. Subakan, and Ç. Yıldız, “Adaptation odyssey in llms: Why does additional pretraining sometimes fail to improve?” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 19 834–19 843

  57. [57]

    Understanding multimodal llms under distribution shifts: An information-theoretic approach, 2025

    C. Oh, Z. Fang, S. Im, X. Du, and Y . Li, “Understanding multimodal llms under distribution shifts: An information-theoretic approach,”arXiv preprint arXiv:2502.00577, 2025

  58. [58]

    Dawin: Training-free dynamic weight interpolation for robust adaptation,

    C. Oh, Y . Li, K. Song, S. Yun, and D. Han, “Dawin: Training-free dynamic weight interpolation for robust adaptation,”arXiv preprint arXiv:2410.03782, 2024

  59. [59]

    Dong Yan, Gaochen Wu, and Bowen Zhou

    J. Wu, Y . George, J. Ye, Y . Wu, D. F. Schmidt, and J. Cai, “Spine: Token-selective test-time reinforcement learning with entropy-band regularization,”arXiv preprint arXiv:2511.17938, 2025

  60. [60]

    arXiv preprint arXiv:2406.09265 , year=

    W. Wang, B. Haddow, M. Wu, W. Peng, and A. Birch, “Sharing matters: Analysing neurons across languages and tasks in llms,”arXiv preprint arXiv:2406.09265, 2024

  61. [61]

    How programming concepts and neurons are shared in code language models,

    A. H. Kargaran, Y . Liu, F. Yvon, and H. Schütze, “How programming concepts and neurons are shared in code language models,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 26 905–26 917

  62. [62]

    Calibrate before use: Improving few-shot performance of language models,

    Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh, “Calibrate before use: Improving few-shot performance of language models,” in International conference on machine learning. Pmlr, 2021, pp. 12 697– 12 706

  63. [63]

    Exploring explanations improves the robustness of in-context learning,

    U. Honda and T. Oka, “Exploring explanations improves the robustness of in-context learning,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 23 693–23 714

  64. [64]

    Intrinsic dimensionality explains the effectiveness of language model fine-tuning,

    A. Aghajanyan, S. Gupta, and L. Zettlemoyer, “Intrinsic dimensionality explains the effectiveness of language model fine-tuning,” inProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), 2021, pp. 7319–7328

  65. [65]

    Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment,

    L. Xu, H. Xie, S. J. Qin, X. Tao, and F. L. Wang, “Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

  66. [66]

    From neurons to semantics: Evaluating cross-linguistic alignment capabilities of large language models via neurons alignment,

    C. Huang, Y . Ye, B. Fu, Q. Su, and X. Shi, “From neurons to semantics: Evaluating cross-linguistic alignment capabilities of large language models via neurons alignment,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 28 956–28 974

  67. [67]

    On relation-specific neurons in large language models,

    Y . Liu, R. Chen, L. Hirlimann, A. D. Hakimi, M. Wang, A. H. Kargaran, S. Rothe, F. Yvon, and H. Schütze, “On relation-specific neurons in large language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 992–1022

  68. [68]

    Teaching small language models to reason,

    L. C. Magister, J. Mallinson, J. Adamek, E. Malmi, and A. Severyn, “Teaching small language models to reason,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2023, pp. 1773–1781

  69. [69]

    Large language models are reasoning teachers,

    N. Ho, L. Schmid, and S.-Y . Yun, “Large language models are reasoning teachers,” inProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), 2023, pp. 14 852– 14 882

  70. [70]

    Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes,

    C.-Y . Hsieh, C.-L. Li, C.-K. Yeh, H. Nakhost, Y . Fujii, A. Ratner, R. Kr- ishna, C.-Y . Lee, and T. Pfister, “Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes,” inFindings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 8003–8017

  71. [71]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

  72. [72]

    Knowledge distillation: A survey,

    J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,”International journal of computer vision, vol. 129, no. 6, pp. 1789–1819, 2021

  73. [73]

    Specializing smaller language models towards multi-step reasoning,

    Y . Fu, H. Peng, L. Ou, A. Sabharwal, and T. Khot, “Specializing smaller language models towards multi-step reasoning,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 10 421–10 430

  74. [74]

    Orca: Progressive Learning from Complex Explanation Traces of GPT-4

    S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah, “Orca: Progressive learning from complex explanation traces of gpt-4,”arXiv preprint arXiv:2306.02707, 2023