arxiv: 2604.19488 · v1 · submitted 2026-04-21 · 💻 cs.AI

Recognition: unknown

CoDA: Towards Effective Cross-domain Knowledge Transfer via CoT-guided Domain Adaptation

Jianzhi Yan , Le Liu , Buzhou Tang , Yang Xiang , Dongning Sun , Zhiming Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords cross-domain adaptationchain-of-thoughtlarge language modelslogical reasoningknowledge transfermaximum mean discrepancyin-context learningdomain shift

0 comments

The pith

A lightweight adapter aligns latent reasoning patterns across domains in LLMs by distilling CoT-enriched representations and matching them with MMD.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models perform well on logical reasoning when given good in-domain examples but often lack such examples in specialized or low-resource fields. Cross-domain samples can be retrieved as substitutes, yet domain shift usually blocks the model from extracting and using shared reasoning structures. CoDA addresses this by inserting a lightweight adapter that operates on intermediate hidden states rather than raw text prompts. The adapter distills chain-of-thought enriched reference features from the source domain and applies maximum mean discrepancy to bring the source and target latent distributions into closer alignment. Experiments across multiple reasoning tasks and model families show this yields large gains over prior cross-domain prompting baselines.

Core claim

CoDA employs a lightweight adapter to directly intervene in the intermediate hidden states of LLMs. By combining feature-based distillation of CoT-enriched reference representations with Maximum Mean Discrepancy for kernelized distribution matching, the method aligns the latent reasoning representation of the source and target domains, allowing more effective transfer of cross-domain knowledge for logical reasoning.

What carries the argument

Lightweight adapter that intervenes in intermediate hidden states through CoT-enriched feature distillation plus MMD-based distribution matching to align source and target reasoning representations.

If this is right

Cross-domain samples become usable as surrogate in-context demonstrations for logical reasoning tasks.
Performance improves substantially in expertise-scarce domains such as certain scientific or legal fields.
The gains hold across different LLM families on multiple reasoning benchmarks.
The method outperforms previous state-of-the-art cross-domain baselines by a large margin.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Reasoning patterns may be partially separable from surface domain content at the level of hidden states.
Similar hidden-state adapters could be tested on non-reasoning transfer tasks such as generation or classification.
The approach raises the possibility that targeted interventions inside the model can substitute for ever-larger context windows or retrieval systems.

Load-bearing premise

Pronounced domain shift between source and target can be overcome by aligning latent reasoning patterns that exist in intermediate hidden states and can be captured via CoT distillation and MMD.

What would settle it

If applying the CoDA adapter produces no measurable reduction in domain discrepancy metrics or no accuracy gains over raw cross-domain prompting on held-out logical reasoning benchmarks, the alignment approach would be shown ineffective.

Figures

Figures reproduced from arXiv: 2604.19488 by Buzhou Tang, Dongning Sun, Jianzhi Yan, Le Liu, Yang Xiang, Zhiming Li.

**Figure 1.** Figure 1: The overall framework of CoDA. (A) Adapted Representation Extraction: A frozen LLM extracts original hidden states (hS, hT ) and the CoT-infused teacher state (h ∗ S ). A lightweight non-linear adapter is then employed to map these states into a shared reasoning-enhanced space. (B) CoT-guided Domain Adaptation: The adapter is optimized using a dual-objective: MSE loss distills the reasoning trajectory by m… view at source ↗

**Figure 3.** Figure 3: t-SNE visualization of latent representations. Compared to the unaligned [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Parameter efficiency vs. OOD accuracy. Gray arrows highlight [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Layer-wise intervention performance. Accuracy of latent intervention [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Large language models (LLMs) have achieved substantial advances in logical reasoning, yet they continue to lag behind human-level performance. In-context learning provides a viable solution that boosts the model's performance via prompting its input with expert-curated, in-domain exemplars. However, in many real-world, expertise-scarce domains, such as low-resource scientific disciplines, emerging biomedical subfields, or niche legal jurisdictions, such high-quality in-domain demonstrations are inherently limited or entirely unavailable, thereby constraining the general applicability of these approaches. To mitigate this limitation, recent efforts have explored the retrieval of cross-domain samples as surrogate in-context demonstrations. Nevertheless, the resulting gains remain modest. This is largely attributable to the pronounced domain shift between source and target distributions, which impedes the model's ability to effectively identify and exploit underlying shared structures or latent reasoning patterns. Consequently, when relying solely on raw textual prompting, LLMs struggle to abstract and transfer such cross-domain knowledge in a robust and systematic manner. To address these issues, we propose CoDA, which employs a lightweight adapter to directly intervene in the intermediate hidden states. By combining feature-based distillation of CoT-enriched reference representations with Maximum Mean Discrepancy (MMD) for kernelized distribution matching, our method aligns the latent reasoning representation of the source and target domains. Extensive experimental results on multiple logical reasoning tasks across various model families validate the efficacy of CoDA by significantly outperforming the previous state-of-the-art baselines by a large margin.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoDA aligns LLM hidden states with CoT distillation plus MMD to handle cross-domain reasoning transfer, but the abstract gives no numbers or ablations so the gains stay unverified.

read the letter

The core idea is straightforward: when in-domain examples are scarce, retrieval from other domains fails because of shift, so CoDA sticks a lightweight adapter into the intermediate layers, distills chain-of-thought features from source references, and uses MMD to pull the source and target hidden-state distributions closer together. That combination is the new piece. Prior work has adapters, CoT prompting, and MMD separately; putting them together specifically to move latent reasoning patterns across domains is a fresh synthesis rather than a rehash of standard domain adaptation or retrieval prompting. The paper also does a clean job stating the practical bottleneck in expertise-scarce fields like niche science or law, where high-quality demonstrations simply do not exist. Framing the problem around shared reasoning structures that survive domain shift is useful even if the fix turns out to be partial. The main weakness is the missing evidence. The abstract claims large-margin wins over SOTA across tasks and model families, yet supplies no scores, no baseline list, no ablation on the CoT component, and no check on whether the alignment actually improves step-by-step logic or just smooths surface statistics. That leaves the central assumption untested: that MMD on these distilled features can bridge real differences in inference rules rather than just regularizing toward source behavior. If the full paper has careful probing experiments and shows the gains survive when the target domain has genuinely different reasoning patterns, the contribution holds. Without that, it reads as an empirical architecture whose payoff is still to be demonstrated. This is worth a reading group for anyone working on efficient LLM adaptation or cross-domain reasoning. A serious editor should send it to review because the problem is concrete and the proposed intervention is simple enough to evaluate properly; referees can ask for the missing controls and ablations. I would not cite it yet on the basis of the abstract alone.

Referee Report

2 major / 2 minor

Summary. The paper proposes CoDA, a cross-domain adaptation technique for LLMs on logical reasoning tasks. It introduces a lightweight adapter that intervenes directly on intermediate hidden states, combining feature-based distillation of Chain-of-Thought (CoT) enriched source representations with Maximum Mean Discrepancy (MMD) kernel matching to align latent reasoning distributions between source and target domains. The central claim is that this alignment enables robust transfer of underlying reasoning patterns, yielding large-margin gains over prior SOTA baselines across multiple tasks and model families in low-resource domains lacking in-domain exemplars.

Significance. If the alignment mechanism demonstrably transfers step-by-step logical structures rather than merely reducing surface-level distributional mismatch, the work could meaningfully extend in-context learning to expertise-scarce domains such as emerging biomedical or niche legal fields, moving beyond retrieval-based prompting.

major comments (2)

[Abstract] Abstract: The assertion that CoDA 'significantly outperform[s] the previous state-of-the-art baselines by a large margin' supplies no quantitative metrics, baseline names, task-specific results, ablation controls, or statistical significance tests, making the central empirical claim impossible to evaluate from the summary alone and requiring full experimental tables for assessment.
[Method] Method description (abstract and implied §3): The claim that MMD on CoT-distilled hidden states 'aligns the latent reasoning representation' rests on the untested assumption that shared reasoning structures exist and are capturable at the level of kernelized activation matching; without ablations isolating the CoT distillation component, error analysis on reasoning-step fidelity, or probing for logical transfer (vs. generic regularization), it remains unclear whether gains reflect reasoning alignment or superficial statistic matching.

minor comments (2)

[Abstract] Abstract: The phrase 'CoT-enriched reference representations' is used without an inline definition or citation to the specific Chain-of-Thought prompting protocol employed for the source references.
[Abstract] Abstract: 'Large margin' is repeated without anchoring to concrete effect sizes or comparison tables that would appear in the experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below, clarifying the manuscript's contributions and outlining targeted revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that CoDA 'significantly outperform[s] the previous state-of-the-art baselines by a large margin' supplies no quantitative metrics, baseline names, task-specific results, ablation controls, or statistical significance tests, making the central empirical claim impossible to evaluate from the summary alone and requiring full experimental tables for assessment.

Authors: We agree that the abstract would be strengthened by including concrete quantitative support. In the revised version, we will incorporate specific metrics (e.g., average accuracy gains and baseline names across the main tasks) while referencing the full results tables and statistical tests in Section 4. This change will make the central claim directly evaluable from the abstract without exceeding length constraints. revision: yes
Referee: [Method] Method description (abstract and implied §3): The claim that MMD on CoT-distilled hidden states 'aligns the latent reasoning representation' rests on the untested assumption that shared reasoning structures exist and are capturable at the level of kernelized activation matching; without ablations isolating the CoT distillation component, error analysis on reasoning-step fidelity, or probing for logical transfer (vs. generic regularization), it remains unclear whether gains reflect reasoning alignment or superficial statistic matching.

Authors: The manuscript already contains ablations in Section 4.3 that isolate the CoT distillation component, demonstrating that its removal leads to substantial performance drops relative to the full CoDA model. Qualitative examples of transferred reasoning chains are also provided in the appendix. We acknowledge, however, that additional quantitative probing for step-level logical fidelity versus generic regularization would provide stronger evidence. We will add a new analysis subsection with error categorization on reasoning steps and a controlled comparison against a non-CoT MMD baseline. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture with no self-referential derivations or load-bearing self-citations

full rationale

The paper presents CoDA as a practical method combining a lightweight adapter, feature-based distillation of CoT-enriched representations, and MMD for distribution alignment. No equations, derivations, or parameter-fitting steps are described that reduce the claimed cross-domain transfer gains to inputs by construction. The abstract and method description treat the components as independent design choices validated empirically, without invoking uniqueness theorems, renaming known results, or self-citation chains for the core premise. This matches the reader's assessment of an empirical architecture lacking self-referential structure. The central claim remains open to external validation via experiments rather than being tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on learned adapter parameters and the assumption of transferable latent structures; no new physical entities or ungrounded constants are introduced.

free parameters (2)

adapter parameters
Lightweight module weights learned to intervene in hidden states; fitted during training.
MMD kernel hyperparameters
Bandwidth and kernel choice for distribution matching; selected to align source-target representations.

axioms (1)

domain assumption Source and target domains share latent reasoning patterns that can be aligned in hidden states
Invoked to justify why MMD and CoT distillation will transfer knowledge effectively.

pith-pipeline@v0.9.0 · 5576 in / 1167 out tokens · 34218 ms · 2026-05-10T02:07:22.281093+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 19 canonical work pages · 9 internal anchors

[1]

A Survey of Large Language Models

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Donget al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, vol. 1, no. 2, pp. 1–124, 2023

work page internal anchor Pith review arXiv 2023
[2]

Mathprompter: Mathematical reasoning using large language models,

S. Imani, L. Du, and H. Shrivastava, “Mathprompter: Mathematical reasoning using large language models,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), 2023, pp. 37–42

2023
[3]

Towards reasoning in large language models: A survey,

J. Huang and K. C.-C. Chang, “Towards reasoning in large language models: A survey,” inFindings of the association for computational linguistics: ACL 2023, 2023, pp. 1049–1065

2023
[4]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

2022
[5]

Towards robust deep reinforcement learning-based quantitative trading with neuro-symbolic trend analysis,

J. Jiang, Z. Li, A. Cui, X. Zhou, B. Li, and D. Sun, “Towards robust deep reinforcement learning-based quantitative trading with neuro-symbolic trend analysis,”Neural Networks, vol. 201, p. 108924,
[6]

Available: https://www.sciencedirect.com/science/article/ pii/S0893608026003850

[Online]. Available: https://www.sciencedirect.com/science/article/ pii/S0893608026003850
[7]

Investigating training data detection in ai coders,

T. Li, Y . Wei, Z. Li, A. Liu, Q. Guo, X. Liu, D. Sun, and Y . Liu, “Investigating training data detection in ai coders,” 2025. [Online]. Available: https://arxiv.org/abs/2507.17389

work page arXiv 2025
[8]

Logic-q: Improving deep reinforcement learning-based quantitative trading via program sketch-based tuning,

Z. Li, J. Jiang, Y . Cao, A. Cui, B. Wu, B. Li, Y . Liu, and D. D. Sun, “Logic-q: Improving deep reinforcement learning-based quantitative trading via program sketch-based tuning,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 17, pp. 18 584–18 592, Apr. 2025. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/ article/v...

2025
[9]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

1901
[10]

In-context learning with retrieved demonstrations for language models: A survey,

M. Luo, X. Xu, Y . Liu, P. Pasupat, and M. Kazemi, “In-context learning with retrieved demonstrations for language models: A survey,”arXiv preprint arXiv:2401.11624, 2024

work page arXiv 2024
[11]

Learning to retrieve prompts for in-context learning,

O. Rubin, J. Herzig, and J. Berant, “Learning to retrieve prompts for in-context learning,” inProceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies, 2022, pp. 2655–2671

2022
[12]

Learning to retrieve in-context examples for large language models,

L. Wang, N. Yang, and F. Wei, “Learning to retrieve in-context examples for large language models,” inProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 1752–1767

2024
[13]

Uncertainty quantification for in-context learning of large language models,

C. Ling, X. Zhao, X. Zhang, W. Cheng, Y . Liu, Y . Sun, M. Oishi, T. Osaki, K. Matsuda, J. Jiet al., “Uncertainty quantification for in-context learning of large language models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024, p...

2024
[14]

Reasoning graph enhanced exemplars retrieval for in-context learning,

Y . Lin, B. Zhong, S. Jiang, J. Siebert, and Q. Chen, “Reasoning graph enhanced exemplars retrieval for in-context learning,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 9737–9759

2025
[15]

Improving neural logic machines via failure reflection,

Z. Li, Y . Cao, Y . ZHENG, X. Liu, B. Wu, T. Li, X. Xu, J. Jiang, Y . S. Teo, S.-W. Lin, and Y . Liu, “Improving neural logic machines via failure reflection,” inForty-first International Conference on Machine Learning,
[16]

Available: https://openreview.net/forum?id=JObct1zyTb

[Online]. Available: https://openreview.net/forum?id=JObct1zyTb
[17]

Fedmut: Generalized federated learning via stochastic mutation,

M. Hu, Y . Cao, A. Li, Z. Li, C. Liu, T. Li, M. Chen, and Y . Liu, “Fedmut: Generalized federated learning via stochastic mutation,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, pp. 12 528–12 537, Mar. 2024. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/29146

2024
[18]

Learning program semantics for vulnerability detection via vulnerability-specific inter-procedural slicing,

B. Wu, S. Liu, Y . Xiao, Z. Li, J. Sun, and S.-W. Lin, “Learning program semantics for vulnerability detection via vulnerability-specific inter-procedural slicing,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2023. New York, NY , USA: Association for C...

work page doi:10.1145/3611643.3616351 2023
[19]

FAIRER: Fairness as decision rationale alignment,

T. Li, Q. Guo, A. Liu, M. Du, Z. Li, and Y . Liu, “FAIRER: Fairness as decision rationale alignment,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–29 Jul 2023, pp. 19 471–19 489. [On...

2023
[20]

Reason Analogically via Cross-domain Prior Knowledge: An Empirical Study of Cross-domain Knowledge Transfer for In-Context Learning

L. Liu, Z. Li, J. Yan, Z. Yuan, S. Chen, Y . Pan, B. Tang, Q. Chen, Y . Xiang, and D. D. Sun, “Reason analogically via cross-domain prior knowledge: An empirical study of cross-domain knowledge transfer for in-context learning,” 2026. [Online]. Available: https://arxiv.org/abs/2604.05396

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

The shape of reasoning: Topological analysis of reasoning traces in large language models.arXiv preprint arXiv:2510.20665, 2025

X. W. Tan, N. Tan, G. Lee, and S. Kok, “The shape of reasoning: Topological analysis of reasoning traces in large language models,”arXiv preprint arXiv:2510.20665, 2025

work page arXiv 2025
[22]

Towards Effective In-context Cross-domain Knowledge Transfer via Domain-invariant-neurons-based Retrieval

J. Yan, Z. Li, L. Liu, Z. Yuan, S. Chen, Y . Pan, B. Tang, Y . Xiang, and D. D. Sun, “Towards effective in-context cross-domain knowledge transfer via domain-invariant-neurons-based retrieval,” 2026. [Online]. Available: https://arxiv.org/abs/2604.05383

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Large language models can be lazy learners: Analyze shortcuts in in-context learning,

R. Tang, D. Kong, L. Huanget al., “Large language models can be lazy learners: Analyze shortcuts in in-context learning,” inFindings of the association for computational linguistics: ACL 2023, 2023, pp. 4645–4657

2023
[24]

Exploring and mitigating shortcut learning for generative large language models,

Z. Sun, Y . Xiao, J. Li, Y . Ji, W. Chen, and M. Zhang, “Exploring and mitigating shortcut learning for generative large language models,” in Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024), 2024, pp. 6883–6893

2024
[25]

Examining the robustness of llm evaluation to the distributional assumptions of benchmarks,

C. Siska, K. Marazopoulou, M. Ailem, and J. Bono, “Examining the robustness of llm evaluation to the distributional assumptions of benchmarks,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 10 406–10 421

2024
[26]

Graph of thoughts: Solving elaborate problems with large language models,

M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyket al., “Graph of thoughts: Solving elaborate problems with large language models,” inProceedings of the AAAI conference on artificial intelligence, vol. 38, no. 16, 2024, pp. 17 682–17 690

2024
[27]

Demystifying chains, trees, and graphs of thoughts,

M. Besta, F. Memedi, Z. Zhang, R. Gerstenberger, G. Piao, N. Blach, P. Nyczyk, M. Copik, G. Kwa ´sniewski, J. Mülleret al., “Demystifying chains, trees, and graphs of thoughts,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[28]

Enhanced data synthesis for llm through reasoning structures generated by hierarchical gflownet,

T. Bu, M. Zhang, H. Duan, S. Li, L. Hu, and Y . Li, “Enhanced data synthesis for llm through reasoning structures generated by hierarchical gflownet,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 15 931–15 958

2025
[29]

arXiv preprint arXiv:2408.09172 , year=

H.-Y . Huang, Z. Wu, Y . Yang, J. Zhang, and Y . Wu, “Unlocking the power of llm uncertainty for active in-context example selection,”arXiv preprint arXiv:2408.09172, 2024

work page arXiv 2024
[30]

Using natural language explanations to improve robustness of in-context learning,

X. He, Y . Wu, O.-M. Camburu, P. Minervini, and P. Stenetorp, “Using natural language explanations to improve robustness of in-context learning,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 13 477– 13 499

2024
[31]

A kernel two-sample test,

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel two-sample test,”The journal of machine learning research, vol. 13, no. 1, pp. 723–773, 2012

2012
[32]

A survey on transfer learning,

S. J. Pan and Q. Yang, “A survey on transfer learning,”IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009

2009
[33]

A survey of multi-source domain adaptation,

S. Sun, H. Shi, and Y . Wu, “A survey of multi-source domain adaptation,” Information Fusion, vol. 24, pp. 84–92, 2015

2015
[34]

A brief review of domain adaptation,

A. Farahani, S. V oghoei, K. Rasheed, and H. R. Arabnia, “A brief review of domain adaptation,”Advances in data science and information engineering: proceedings from ICDATA 2020 and IKE 2020, pp. 877–894, 2021

2020
[35]

Cot vectors: Transfer- ring and probing the reasoning mechanisms of llms,

L. Li, Z. Wang, Y . Wu, J. Cai, and X. Yang, “Cot vectors: Transfer- ring and probing the reasoning mechanisms of llms,”arXiv preprint arXiv:2510.00579, 2025

work page arXiv 2025
[36]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakanoet al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[37]

Non-interactive symbolic-aided chain-of-thought for logical reasoning,

P. M. Nguyen, T. Dang, and N. Inoue, “Non-interactive symbolic-aided chain-of-thought for logical reasoning,” inProceedings of the 39th Pacific Asia Conference on Language, Information and Computation, 2025, pp. 329–340

2025
[38]

Folio: Natural language reasoning with first-order logic,

S. Han, H. Schoelkopf, Y . Zhao, Z. Qi, M. Riddell, W. Zhou, J. Coady, D. Peng, Y . Qiao, L. Bensonet al., “Folio: Natural language reasoning with first-order logic,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 22 017– 22 031

2024
[39]

Proofwriter: Generating implications, proofs, and abductive statements over natural language,

O. Tafjord, B. Dalvi, and P. Clark, “Proofwriter: Generating implications, proofs, and abductive statements over natural language,” inFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 3621–3634

2021
[40]

Commonsenseqa: A question answering challenge targeting commonsense knowledge,

A. Talmor, J. Herzig, N. Lourie, and J. Berant, “Commonsenseqa: A question answering challenge targeting commonsense knowledge,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4149–4158

2019
[41]

Gemma 3 technical report,

G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, M. Rivière, L. Rouillard, T. Mesnard, G. Cideron, J. bastien Grill, S. Ramos, E. Yvinec, M. Casbon, E. Pot, I. Penchev, G. Liu, F. Visin, K. Kenealy, L. Beyer, X. Zhai, A. Tsitsulin, R. Busa-Fekete, A. Feng, N. Sachdeva, B. Coleman, Y . Gao, B. Mustafa,...
[42]

Gemma 3 Technical Report

[Online]. Available: https://arxiv.org/abs/2503.19786

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Qwen2.5: A party of foundation models,

Q. Team, “Qwen2.5: A party of foundation models,” September 2024. [Online]. Available: https://qwenlm.github.io/blog/qwen2.5/

2024
[44]

Robertson and H

S. Robertson and H. Zaragoza,The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc, 2009, vol. 4

2009
[45]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

2020
[46]

Revisiting demonstration selection strategies in in-context learning,

K. Peng, L. Ding, Y . Yuan, X. Liu, M. Zhang, Y . Ouyang, and D. Tao, “Revisiting demonstration selection strategies in in-context learning,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 9090– 9101

2024
[47]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” Iclr, vol. 1, no. 2, p. 3, 2022

2022
[48]

P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks,

X. Liu, K. Ji, Y . Fu, W. Tam, Z. Du, Z. Yang, and J. Tang, “P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2022, pp. 61–68

2022
[49]

Steering Llama 2 via Contrastive Activation Addition

N. Panickssery, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. M. Turner, “Steering llama 2 via contrastive activation addition,”arXiv preprint arXiv:2312.06681, 2023

work page internal anchor Pith review arXiv 2023
[50]

Adam: A method for stochastic optimization,

K. Diederik, “Adam: A method for stochastic optimization,”(No Title), 2014

2014
[51]

C- pack: Packed resources for general chinese embeddings,

S. Xiao, Z. Liu, P. Zhang, N. Muennighoff, D. Lian, and J.-Y . Nie, “C- pack: Packed resources for general chinese embeddings,” inProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, 2024, pp. 641–649

2024
[52]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,” JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12 inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Bangkok, Thailand: Association f...

work page internal anchor Pith review arXiv 2021
[53]

Visualizing data using t-sne

L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.”Journal of machine learning research, vol. 9, no. 11, 2008

2008
[54]

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,

P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,”Journal of computational and applied mathematics, vol. 20, pp. 53–65, 1987

1987
[55]

A test metric for assessing single-cell rna-seq batch correction,

M. Büttner, Z. Miao, F. A. Wolf, S. A. Teichmann, and F. J. Theis, “A test metric for assessing single-cell rna-seq batch correction,”Nature methods, vol. 16, no. 1, pp. 43–49, 2019

2019
[56]

Adaptation odyssey in llms: Why does additional pretraining sometimes fail to improve?

F. Öncel, M. Bethge, B. Ermis, M. Ravanelli, C. Subakan, and Ç. Yıldız, “Adaptation odyssey in llms: Why does additional pretraining sometimes fail to improve?” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 19 834–19 843

2024
[57]

Understanding multimodal llms under distribution shifts: An information-theoretic approach, 2025

C. Oh, Z. Fang, S. Im, X. Du, and Y . Li, “Understanding multimodal llms under distribution shifts: An information-theoretic approach,”arXiv preprint arXiv:2502.00577, 2025

work page arXiv 2025
[58]

Dawin: Training-free dynamic weight interpolation for robust adaptation,

C. Oh, Y . Li, K. Song, S. Yun, and D. Han, “Dawin: Training-free dynamic weight interpolation for robust adaptation,”arXiv preprint arXiv:2410.03782, 2024

work page arXiv 2024
[59]

Dong Yan, Gaochen Wu, and Bowen Zhou

J. Wu, Y . George, J. Ye, Y . Wu, D. F. Schmidt, and J. Cai, “Spine: Token-selective test-time reinforcement learning with entropy-band regularization,”arXiv preprint arXiv:2511.17938, 2025

work page arXiv 2025
[60]

arXiv preprint arXiv:2406.09265 , year=

W. Wang, B. Haddow, M. Wu, W. Peng, and A. Birch, “Sharing matters: Analysing neurons across languages and tasks in llms,”arXiv preprint arXiv:2406.09265, 2024

work page arXiv 2024
[61]

How programming concepts and neurons are shared in code language models,

A. H. Kargaran, Y . Liu, F. Yvon, and H. Schütze, “How programming concepts and neurons are shared in code language models,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 26 905–26 917

2025
[62]

Calibrate before use: Improving few-shot performance of language models,

Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh, “Calibrate before use: Improving few-shot performance of language models,” in International conference on machine learning. Pmlr, 2021, pp. 12 697– 12 706

2021
[63]

Exploring explanations improves the robustness of in-context learning,

U. Honda and T. Oka, “Exploring explanations improves the robustness of in-context learning,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 23 693–23 714

2025
[64]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning,

A. Aghajanyan, S. Gupta, and L. Zettlemoyer, “Intrinsic dimensionality explains the effectiveness of language model fine-tuning,” inProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), 2021, pp. 7319–7328

2021
[65]

Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment,

L. Xu, H. Xie, S. J. Qin, X. Tao, and F. L. Wang, “Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

2026
[66]

From neurons to semantics: Evaluating cross-linguistic alignment capabilities of large language models via neurons alignment,

C. Huang, Y . Ye, B. Fu, Q. Su, and X. Shi, “From neurons to semantics: Evaluating cross-linguistic alignment capabilities of large language models via neurons alignment,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 28 956–28 974

2025
[67]

On relation-specific neurons in large language models,

Y . Liu, R. Chen, L. Hirlimann, A. D. Hakimi, M. Wang, A. H. Kargaran, S. Rothe, F. Yvon, and H. Schütze, “On relation-specific neurons in large language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 992–1022

2025
[68]

Teaching small language models to reason,

L. C. Magister, J. Mallinson, J. Adamek, E. Malmi, and A. Severyn, “Teaching small language models to reason,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2023, pp. 1773–1781

2023
[69]

Large language models are reasoning teachers,

N. Ho, L. Schmid, and S.-Y . Yun, “Large language models are reasoning teachers,” inProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), 2023, pp. 14 852– 14 882

2023
[70]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes,

C.-Y . Hsieh, C.-L. Li, C.-K. Yeh, H. Nakhost, Y . Fujii, A. Ratner, R. Kr- ishna, C.-Y . Lee, and T. Pfister, “Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes,” inFindings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 8003–8017

2023
[71]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[72]

Knowledge distillation: A survey,

J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,”International journal of computer vision, vol. 129, no. 6, pp. 1789–1819, 2021

2021
[73]

Specializing smaller language models towards multi-step reasoning,

Y . Fu, H. Peng, L. Ou, A. Sabharwal, and T. Khot, “Specializing smaller language models towards multi-step reasoning,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 10 421–10 430

2023
[74]

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah, “Orca: Progressive learning from complex explanation traces of gpt-4,”arXiv preprint arXiv:2306.02707, 2023

work page internal anchor Pith review arXiv 2023