arxiv: 2605.01732 · v1 · submitted 2026-05-03 · 💻 cs.CL

Recognition: unknown

EGAD: Entropy-Guided Adaptive Distillation for Token-Level Knowledge Transfer

Guangxin Wu, Hao Zhang, Jiafeng Guo, Wanyi Ning, Xueqi Cheng, Zhibin Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:08 UTC · model grok-4.3

classification 💻 cs.CL

keywords knowledge distillationtoken-level adaptationentropy-guided learninglarge language modelsmodel compressioncurriculum learningadaptive temperature

0 comments

The pith

Entropy from the teacher model dynamically adjusts token-level curriculum, temperature, and distillation branches to improve knowledge transfer to smaller student models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are too big for many uses, so knowledge distillation tries to compress their abilities into smaller students, but standard methods treat every token the same even though some tokens matter more for decisions. This paper claims that measuring the entropy of the teacher's output probabilities lets the training process adapt automatically: it starts with low-entropy tokens and moves to high-entropy ones, scales the distillation temperature per token to match teacher confidence, and switches between simple logit matching for easy tokens and richer feature matching for hard ones. The result is supposed to be more efficient and effective transfer because the student spends its capacity where the teacher is actually uncertain rather than wasting it on obvious cases. If the approach works, compact models could close more of the performance gap with their oversized teachers while using less training compute.

Core claim

We propose an entropy-based adaptive distillation strategy that dynamically adjusts the training process at the token level. Our method leverages the teacher's output entropy to guide three aspects of distillation: a token-level curriculum by dynamically shifting focus from low- to high-entropy tokens during training, adjustment of the distillation temperature based on token entropy to better capture teacher confidence patterns, and a dual-branch architecture for efficient logits-only distillation on easy tokens and deeper feature-based distillation on difficult tokens.

What carries the argument

Teacher output entropy, which measures uncertainty in the next-token distribution and is used to adapt curriculum order, temperature scaling, and choice between logit-only and feature-based distillation branches for each token.

If this is right

Student models reach higher task performance for the same parameter count because training effort is concentrated on tokens where the teacher shows high uncertainty.
Overall distillation training time decreases because easy low-entropy tokens use a cheaper logits-only branch instead of full feature extraction.
Temperature scaling per token lets the student better imitate the teacher's varying confidence levels rather than a single global temperature.
The curriculum ordering produces a natural progression from simple to complex tokens, similar to human learning schedules but derived automatically from entropy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same entropy signal could be reused to decide when to stop distilling a given token or to weight the loss dynamically beyond the three changes described.
If entropy correlates with token difficulty across languages, the method might improve cross-lingual distillation without language-specific tuning.
The dual-branch switch could be extended to other efficiency techniques such as early exiting or sparse attention on low-entropy tokens.

Load-bearing premise

The entropy of the teacher's predictions reliably marks tokens that are differentially important or difficult for the student, and the three adaptive changes together produce net gains without adding new training instabilities or biases.

What would settle it

On standard benchmarks such as GLUE or SuperGLUE, student models trained with the entropy-guided method show no accuracy gain or lower accuracy than identical students trained with uniform distillation under the same compute budget.

Figures

Figures reproduced from arXiv: 2605.01732 by Guangxin Wu, Hao Zhang, Jiafeng Guo, Wanyi Ning, Xueqi Cheng, Zhibin Zhang.

**Figure 2.** Figure 2: Overview of EGAD. Given an input sequence, the teacher model produces [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Performance comparison among different distillation methods. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Token-level entropy predicted by the teacher model for two randomly [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Large language models (LLMs) have achieved remarkable performance across diverse domains, yet their enormous computational and memory requirements hinder deployment in resource-constrained environments. Knowledge distillation offers a promising solution by transferring knowledge from a large teacher model to a smaller student model. However, existing distillation methods typically treat all tokens equally, ignoring the fact that different tokens contribute unequally to model decisions. This can lead to inefficient knowledge transfer and reduced learning effectiveness. To address this limitation, we propose an entropy-based adaptive distillation strategy that dynamically adjusts the training process at the token level. Our method leverages the teacher's output entropy to guide three aspects of distillation. Specifically, we introduce a token-level curriculum by dynamically shifting focus from low- to high-entropy tokens during training. We further adjust the distillation temperature based on token entropy to better capture teacher confidence patterns. Moreover, we employ a dual-branch architecture for efficient logits-only distillation on easy tokens and deeper feature-based distillation on difficult tokens. Extensive experiments validate the soundness and effectiveness of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean token-level entropy heuristic for distillation that combines curriculum, temperature, and dual-branch choices, but the abstract leaves the actual gains and comparisons unshown.

read the letter

The core idea here is to stop treating every token the same during distillation and instead let the teacher's entropy decide three things: which tokens to emphasize early in training, how soft the targets should be, and whether to distill just logits or also features. That produces a three-part scheme that feels like a natural next step from existing adaptive distillation work rather than a big conceptual leap. The design itself is straightforward to implement and avoids heavy extra machinery, which is a plus for anyone trying to compress models for real deployment. It also correctly identifies that high-entropy tokens are often the ones where the teacher is uncertain and the student has more to gain. The paper does a decent job laying out why uniform treatment wastes effort and how entropy can serve as a cheap proxy for difficulty. On the downside, the abstract only asserts that experiments back this up without giving effect sizes, the exact baselines, ablation breakdowns, or any statistical checks. That makes it hard to judge whether the entropy signals actually drive the gains or whether simpler schedules would do almost as well. The claim that the specific combination is new also sits on top of a fair amount of prior entropy-aware and curriculum distillation papers, so the advance looks incremental rather than foundational. This is the kind of paper that would interest people shipping smaller models or running distillation pipelines on limited hardware. A reader who already knows the standard KD literature could pick up the implementation details quickly and test the three levers themselves. I would send it to peer review. The construction is internally consistent, the central assumption is falsifiable, and the practical angle is clear enough that referees can evaluate whether the reported numbers hold up.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes EGAD, an entropy-guided adaptive distillation strategy for token-level knowledge transfer from large teacher LLMs to smaller student models. It dynamically adjusts the distillation process at the token level by using the teacher's output entropy to implement (1) a curriculum that shifts focus from low- to high-entropy tokens, (2) entropy-dependent temperature scaling, and (3) a dual-branch architecture applying logits-only distillation to easy tokens and deeper feature-based distillation to difficult tokens. The authors assert that this addresses the limitation of treating all tokens equally in prior methods and that extensive experiments validate its soundness and effectiveness.

Significance. If the empirical results hold and demonstrate consistent gains over standard distillation baselines, the approach could meaningfully advance efficient LLM deployment by making knowledge transfer adaptive to token uncertainty, potentially improving student performance with reduced computational overhead. The design is a coherent heuristic extension of existing curriculum and temperature techniques, directly targeting a known inefficiency in uniform token treatment.

major comments (1)

Abstract: The central claim that 'extensive experiments validate the soundness and effectiveness of our method' is load-bearing, yet the manuscript provides no quantitative results, specific baselines, ablation studies, or statistical significance tests. Without these, the effectiveness of the entropy-guided adjustments cannot be assessed.

minor comments (1)

The description of how entropy is computed and thresholded for the curriculum and dual-branch decisions would benefit from explicit equations and pseudocode to ensure reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for your review of our manuscript. We appreciate the detailed feedback and address the concern regarding the abstract below.

read point-by-point responses

Referee: Abstract: The central claim that 'extensive experiments validate the soundness and effectiveness of our method' is load-bearing, yet the manuscript provides no quantitative results, specific baselines, ablation studies, or statistical significance tests. Without these, the effectiveness of the entropy-guided adjustments cannot be assessed.

Authors: We agree that the abstract makes a strong claim about experimental validation that is not supported by any quantitative results, baselines, ablations, or statistical tests in the manuscript text provided. This is a substantive shortcoming, as the effectiveness of the proposed entropy-guided curriculum, temperature scaling, and dual-branch design cannot be evaluated without such evidence. We will revise the abstract to remove the phrase 'extensive experiments validate the soundness and effectiveness of our method' and replace it with a neutral description of the proposed approach. In the revised submission, we will either incorporate a concise summary of key results (if the full experimental section exists) or ensure the main body includes the required quantitative comparisons, ablations, and significance testing before resubmission. revision: yes

Circularity Check

0 steps flagged

No significant circularity; heuristic design with no self-referential reductions

full rationale

The paper proposes EGAD as a heuristic entropy-guided adaptive distillation method that introduces three interlocking adjustments (token-level curriculum from low- to high-entropy tokens, entropy-based temperature scaling, and dual-branch logits-vs-feature distillation) to address unequal token importance. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described construction that would reduce any claimed result to its own inputs by definition. The method is presented as an empirical design choice validated by experiments, not a tautological or self-citation-forced outcome, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, mathematical axioms, or newly postulated entities; the approach rests on standard assumptions of knowledge distillation plus the unstated premise that entropy is a suitable proxy for token difficulty.

pith-pipeline@v0.9.0 · 5486 in / 1109 out tokens · 70599 ms · 2026-05-10T16:08:19.769167+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 29 canonical work pages · 11 internal anchors

[1]

Anonymous: Difficulty aware knowledge distillation (da-kd) (2024), unpub- lished citation

2024
[2]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (2024)

Asai, A., Nguyen, H., Srinivasan, L., Clark, C.: Buffet: Benchmarking large language model fine-tuning across data domains. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (2024)

2024
[3]

Slicegpt: Compress large language models by deleting rows and columns,

Ashkboos, S., Croci, M.L., Nascimento, M.G.d., Hoefler, T., Hensman, J.: Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024 (2024)

work page arXiv 2024
[4]

Ba, J., Caruana, R.: Do deep nets really need to be deep? Advances in neural information processing systems27(2014)

2014
[5]

arXiv preprint arXiv:2302.06557 (2023)

Cai, Y., Wang, Z., Li, Y., Wang, S., Liu, Z., Sun, M.: Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2302.06557 (2023)

work page arXiv 2023
[6]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Cheng, X., Rao, Z., Chen, Y., Zhang, Q.: Explaining knowledge distillation by quantifying the knowledge. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12925–12935 (2020)

2020
[7]

GPT-4 Technical Report

Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing gpt-4 with 90% quality. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

In: Annual International Conference of the IEEE Engineering in Medicine and Biol- ogy Society

Cong, Z., Wang, Z., Zhang, H., Zheng, G., Cao, K., Zhao, L., Song, R., Li, J., Liu, C.: Hierarchical multi-scale feature fusion network for multi-center major depressive disorder classification with t1-weighted mri. In: Annual International Conference of the IEEE Engineering in Medicine and Biol- ogy Society. IEEE Engineering in Medicine and Biology Socie...

2025
[9]

https://github.com/databricks- datasets/dolly-15k (2023)

Databricks: Databricks dolly 15k. https://github.com/databricks- datasets/dolly-15k (2023)

2023
[10]

Missing-by-Design: Certifiable Modality Deletion for Revocable Multimodal Sentiment Analysis

Fu, R., Wang, Z., Meng, C., Lu, J., Wu, J., Qian, K., Zhang, H., Fong, S.: Missing-by-design: Certifiable modality deletion for revocable multimodal sentiment analysis. arXiv preprint arXiv:2602.16144 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

arXiv preprint arXiv:2306.03964 (2023)

Gu, X., Sun, Q., Ma, H., Wang, B.: Minillm: Knowledge distillation of large language models. arXiv preprint arXiv:2306.03964 (2023)

work page arXiv 2023
[12]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

arXiv preprint arXiv:2501.15167 (2025)

He, Y., Wang, J., Wang, Y., Zhong, Y., Song, X., Lin, J., Yuan, X., Tang, J., Xin, Y., Zhang, H., et al.: Enhancing intent understanding for am- biguous prompt: A human-machine co-adaption strategy. arXiv preprint arXiv:2501.15167 (2025)

work page arXiv 2025
[14]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural net- work. arXiv preprint arXiv:1503.02531 (2015) EGAD 17

work page internal anchor Pith review Pith/arXiv arXiv 2015
[15]

In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Honovich, O., Scialom, T., Levy, O., Ben-Ari, R.: Unnatural instructions: Tuning language models with multi-task instructions. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 7229–7249 (2023)

2023
[16]

Hu, K., Zhang, W., Wang, T., Zhang, H., Wang, W., Long, H.: P2r-obb: A unified framework for multi-scale and orientation-aware ship detection (2026)

2026
[17]

In: ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Jiang, Y., Han, M., Li, M., Hou, X., Zhang, H., Zhu, W., Li, H., He, Y., Wu, G., Yang, D., et al.: Multi-agent diagnostic collaboration and segmentation- aware residual decoding for hallucination-resistant medical vqa. In: ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 11122–11126. IEEE (2026)

2026
[18]

In: Proceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing

Jung, S., Yoon, S., Kim, D., Lee, H.: Todi: Token-wise distillation via fine- grained divergence control. In: Proceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing. pp. 8089–8102 (2025)

2025
[19]

arXiv preprint arXiv:2601.12815 (2026)

Kang, Z., Gong, J., Chen, Q., Zhang, H., Liu, J., Fu, R., Feng, Z., Wang, Y., Fong, S., Zhou, K.: Multimodal multi-agent empowered legal judgment prediction. arXiv preprint arXiv:2601.12815 (2026)

work page arXiv 2026
[20]

Kwon, K., Na, H., Lee, H., Kim, N.S.: Adaptive knowledge distillation based onentropy.In:ICASSP2020-2020IEEEInternationalConferenceonAcous- tics, Speech and Signal Processing (ICASSP). pp. 7409–7413. IEEE (2020)

2020
[21]

arXiv preprint (2025)

Li, Y., et al.: Bild: Bidirectional logit distillation for large language models. arXiv preprint (2025)

2025
[22]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

arXiv preprint arXiv:2503.17970 (2025)

Luo, Y., Wang, S., Liu, J., Xiao, J., Xue, R., Zhang, Z., Zhang, H., Lu, Y., Zhao, Y., Xie, Y.: Pathohr: Breast cancer survival prediction on high- resolution pathological images. arXiv preprint arXiv:2503.17970 (2025)

work page arXiv 2025
[24]

arXiv preprint arXiv:2601.20679 (2026)

Mo, M., Tan, Y., Zhang, H., Zhang, H., He, Y.: Shieldedcode: Learning robust representations for virtual machine protected code. arXiv preprint arXiv:2601.20679 (2026)

work page arXiv 2026
[25]

In: Advances in Neural Information Processing Systems

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, S.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems. vol. 35, pp. 27730–27744 (2022)

2022
[26]

In: Advances in Neural Information Processing Systems

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An open source machine learning framework. In: Advances in Neural Information Processing Systems. vol. 32 (2019)

2019
[27]

In: Annual Conference on Medical Image Understanding and Analysis

Qi, X., Zhang, Z., Gang, C., Zhang, H., Zhang, L., Zhang, Z., Zhao, Y.: Mediaug: Exploring visual augmentation in medical imaging. In: Annual Conference on Medical Image Understanding and Analysis. pp. 218–232. Springer (2025)

2025
[28]

arXiv preprint arXiv:2502.00631 (2025) 18 Authors Suppressed Due to Excessive Length

Qi, X., Zhang, Z., Zheng, H., Chen, M., Kutaiba, N., Lim, R., Chi- ang, C., Tham, Z.E., Ren, X., Zhang, W., et al.: Medconv: Convolutions beat transformers on long-tailed bone density prediction. arXiv preprint arXiv:2502.00631 (2025) 18 Authors Suppressed Due to Excessive Length

work page arXiv 2025
[29]

In: Proceedings of the 5th Workshop on Representation Learning for NLP

Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled ver- sion of bert: smaller, faster, cheaper and lighter. In: Proceedings of the 5th Workshop on Representation Learning for NLP. pp. 1–7 (2019)

2019
[30]

arXiv preprint (2025)

Su, W., et al.: Ea-kd: Entropy based adaptive knowledge distillation for large language models. arXiv preprint (2025)

2025
[31]

bioRxiv pp

Wang, B., Zhang, H., Cui, T., Wang, X., Song, J., Xu, H.: Evormd: Inte- grating biological context and evolutionary rna language models for inter- pretable prediction of rna modifications. bioRxiv pp. 2026–03 (2026)

2026
[32]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Wang, S., Yu, L., Gao, C., Zheng, C., Liu, S., Lu, R., Dang, K., Chen, X., Yang, J., Zhang, Z., et al.: Beyond the 80/20 rule: High-entropy minority to- kens drive effective reinforcement learning for llm reasoning. arXiv preprint arXiv:2506.01939 (2025)

work page internal anchor Pith review arXiv 2025
[33]

DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection

Wang, S., Li, Y., Hu, B., Li, Z., Zhan, H., Li, L., Liu, W., Qian, R., Wu, G., Zhang, H., et al.: Deco-detr: Decoupled cognition detr for efficient open- vocabulary object detection. arXiv preprint arXiv:2604.02753 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Wang, Y., Kordi, Y., Liu, S., Liu, Y., Smith, N.A., Khashabi, D., Hajishirzi, H.: Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560 (2023)

work page internal anchor Pith review arXiv 2023
[35]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wanyan, Y., Yang, X., Chen, C., Xu, C.: Active exploration of multimodal complementarity for few-shot action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6492–6502 (2023)

2023
[36]

arXiv preprint arXiv:2506.07237 (2025)

Wei, J.C., Lin, Y.C., Ritter-Gutierrez, F., Lee, H.y.: Multi-distillation from speech and music representation models. arXiv preprint arXiv:2506.07237 (2025)

work page arXiv 2025
[37]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2020)

work page internal anchor Pith review arXiv 1910
[38]

arXiv preprint arXiv:2601.02674 (2026)

Wu, G., Zhang, H., Zhibin, Z., Guo, J., Cheng, X.: Iterative structured pruning for large language models with multi-domain calibration. arXiv preprint arXiv:2601.02674 (2026)

work page arXiv 2026
[39]

ProtoFlow: Mitigating Forgetting in Class-Incremental Remote Sensing Segmentation via Low-Curvature Prototype Flow

Wu, J., Fu, R., Li, C., Zhang, Z., Wu, G., Zhang, H., Lin, S., Ni, J., Li, Y., Zhang, D., et al.: Protoflow: Mitigating forgetting in class-incremental re- mote sensing segmentation via low-curvature prototype flow. arXiv preprint arXiv:2604.03212 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

arXiv preprint arXiv:2504.05652 , year=

Wu, Y.H., Xiong, Y.J., Zhang, H., Zhang, J.C., Zhou, Z.: Sugar- coated poison: Benign generation unlocks llm jailbreaking. arXiv preprint arXiv:2504.05652 (2025)

work page arXiv 2025
[41]

Qwen2.5 Technical Report

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al.: Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

arXiv preprint arXiv:2602.01745 (2026)

Yu, W., Wei, S., Liu, J., Li, Y., Hu, M., Liu, A., Zhang, H., King, I.: Probability-entropy calibration: An elastic indicator for adaptive fine- tuning. arXiv preprint arXiv:2602.01745 (2026)

work page arXiv 2026
[43]

Mi- prun: Optimize large language model pruning via mutual information.arXiv preprint arXiv:2601.07212, 2026

Zhang, H., Zhang, Z., Wu, G., Chen, H., Guo, J., Cheng, X.: Mi-prun: Opti- mize large language model pruning via mutual information. arXiv preprint arXiv:2601.07212 (2026) EGAD 19

work page arXiv 2026
[44]

arXiv preprint arXiv:2509.12715 (2025)

Zhang, H., Hu, H., Shen, Y., Yu, W., Yuan, Y., You, H., Cheng, G., Zhang, Z., Gan, L., Wei, H., et al.: Asymoe: Leveraging modal asymmetry for en- hanced expert specialization in large vision-language models. arXiv preprint arXiv:2509.12715 (2025)

work page arXiv 2025
[45]

In: ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Zhang, H., Yu, W., Gong, Y., Huang, W., Zhang, H., Huang, J.: Guid- ing efficient llm instruction-tuning via gradient flow matching. In: ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4981–4985. IEEE (2026)

2026
[46]

Zhang, H., Zhang, T., Shi, Y., Gu, X., Shen, Y., Zhang, Z., Yuan, Y., Zhang, H., Huang, J.: Can representation gaps be the key to enhancing robustness in graph-text alignment? arXiv preprint arXiv:2510.12087 (2025)

work page arXiv 2025
[47]

arXiv preprint arXiv:2511.00908 (2025)

Zheng, H., Shi, Y., Gu, X., You, H., Zhang, Z., Gan, L., Zhang, H., Huang, W., Huang, J.: Graphgeo: Multi-agent debate framework for visual geo-localization with heterogeneous graph neural networks. arXiv preprint arXiv:2511.00908 (2025)

work page arXiv 2025
[48]

arXiv preprint arXiv:2511.00911 (2025)

Zheng, H., You, H., Liu, Z., Zhang, Z., Gan, L., Zhang, H., Huang, W., Huang, J.: G2rammar: Bilingual grammar modeling for enhanced text- attributed graph learning. arXiv preprint arXiv:2511.00911 (2025)

work page arXiv 2025
[49]

Knowledge distillation based on transformed teacher matching

Zheng, K., Yang, E.H.: Knowledge distillation based on transformed teacher matching. arXiv preprint arXiv:2402.11148 (2024)

work page arXiv 2024
[50]

Advances in Neural Information Processing Systems36, 55006–55021 (2023)

Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., et al.: Lima: Less is more for alignment. Advances in Neural Information Processing Systems36, 55006–55021 (2023)

2023
[51]

In: ICASSP 2026-2026 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP)

Zhou, W., Wu, G., Zhang, H.: Hot-p: Hierarchical optimal transport pro- totyping for self-supervised learning. In: ICASSP 2026-2026 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5301–5305. IEEE (2026)

2026
[52]

Pattern Recognition153, 110545 (2024)

Zhu,S.,Shang,R.,Yuan,B.,Zhang,W.,Li,W.,Li,Y.,Jiao,L.:Dynamickd: An effective knowledge distillation via dynamic entropy correction-based distillation for gap optimizing. Pattern Recognition153, 110545 (2024)

2024
[53]

In: ICASSP 2026-2026 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP)

Zu, L., Jin, Y., Cao, S., Suo, S., Lyu, H., Fu, S., Sun, H., Zhang, H.: End- to-end story visualization framework with penalty-based evaluation using vision-language models. In: ICASSP 2026-2026 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). pp. 10492– 10496. IEEE (2026) 20 Authors Suppressed Due to Excessive Length A...

2026