EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention
Pith reviewed 2026-05-18 20:51 UTC · model grok-4.3
The pith
EyeMulator augments CodeLLM fine-tuning loss with token weights from human eye-tracking scan paths to mimic developer visual focus.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EyeMulator derives token-level attention weights directly from human eye-tracking scan paths collected during program comprehension and uses those weights to augment the loss function while fine-tuning CodeLLMs. The resulting models are induced to prioritize semantically salient tokens in the same manner as human developers. Experiments across StarCoder, Llama-3.2, and DeepSeek-Coder report gains exceeding 30 CodeBLEU points on translation and up to 22 BERTScore points on summarization, with ablations attributing the improvements to the human-attention component.
What carries the argument
Token-level attention weights extracted from eye-tracking scan paths, inserted as an additive term in the fine-tuning loss to bias the model toward human visual salience.
If this is right
- Code translation performance rises by more than 30 CodeBLEU points when the loss incorporates human attention weights.
- Code summarization improves by as much as 22 BERTScore points across the tested models.
- The same procedure works without architectural changes on StarCoder, Llama-3.2, and DeepSeek-Coder.
- Ablation results indicate that removing the human-attention term eliminates most of the observed gains.
Where Pith is reading between the lines
- The method could be tested on additional code tasks such as bug localization or test generation where human attention patterns may also highlight relevant regions.
- Aggregating eye-tracking data across many programmers might produce more stable attention weights than single-user recordings.
- If the attention weights prove consistent across languages, the approach could be applied to low-resource programming languages with limited training data.
Load-bearing premise
Human eye-tracking scan paths supply a reliable, task-general signal of semantic importance that can be transferred to new code examples and models without creating new error patterns.
What would settle it
Run identical fine-tuning on the same data with and without the eye-tracking-derived attention weights; if the two resulting models show no difference in CodeBLEU or BERTScore on held-out translation and summarization sets, the central claim does not hold.
Figures
read the original abstract
Code Language Models (CodeLLMs) traditionally learn attention based solely on statistical input-output token correlations ("machine attention"). In contrast, human developers rely on intuition, selectively fixating on semantically salient tokens during program comprehension. We present EyeMulator, a model-agnostic technique to align CodeLLM attention with human visual attention without architectural changes. By extracting scan paths from eye-tracking data, we derive token-level attention weights used to augment the loss function during fine-tuning. This induces the model to mimic human focus. Our evaluation across StarCoder, Llama-3.2, and DeepSeek-Coder shows that EyeMulator significantly outperforms baselines, achieving gains of over 30 CodeBLEU points in translation and up to 22 BERTScore points in summarization. Ablation studies confirm that these gains stem directly from replicating human attention dynamics. Artifacts are available at https://zenodo.org/records/17205682.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EyeMulator, a model-agnostic technique for improving CodeLLMs by aligning model attention with human visual attention extracted from eye-tracking scan paths. Token-level attention weights derived from these paths are used to augment the loss function during fine-tuning of models including StarCoder, Llama-3.2, and DeepSeek-Coder. The authors report large gains (over 30 CodeBLEU points in translation and up to 22 BERTScore points in summarization) and claim via ablation studies that improvements stem directly from replicating human attention dynamics. Artifacts are released on Zenodo.
Significance. If the central claims hold after clarification of the alignment procedure, the work would offer a practical, architecture-preserving way to inject human cognitive priors into code model training. The public artifacts are a positive step toward reproducibility. The approach could influence future efforts to ground LLM training in human comprehension signals, though its dependence on external eye-tracking data limits immediate scalability.
major comments (1)
- [Method section] Method section: the procedure for mapping eye-tracking fixations to token-level attention weights is underspecified. No details are provided on attribution rules for fixations landing on whitespace, comments, or inter-token spaces, nor on whether fixation duration is normalized by token length or visual prominence. This ambiguity risks the derived weights reflecting editor layout artifacts rather than semantic salience, which directly undermines the ablation claim that gains arise from 'human attention dynamics' rather than incidental re-weighting of the training distribution.
minor comments (2)
- [Abstract and §4] Abstract and evaluation sections: more explicit description of the eye-tracking dataset collection protocol, exact loss augmentation formula, baseline definitions, and statistical significance tests would strengthen verifiability of the reported numeric gains.
- [Evaluation] The manuscript should clarify whether the eye-tracking data is task-specific to the downstream translation/summarization benchmarks or drawn from a separate comprehension corpus.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our submission. The major comment highlights an important area for clarification in the Method section, and we have revised the manuscript to provide the requested details while preserving the core claims.
read point-by-point responses
-
Referee: [Method section] Method section: the procedure for mapping eye-tracking fixations to token-level attention weights is underspecified. No details are provided on attribution rules for fixations landing on whitespace, comments, or inter-token spaces, nor on whether fixation duration is normalized by token length or visual prominence. This ambiguity risks the derived weights reflecting editor layout artifacts rather than semantic salience, which directly undermines the ablation claim that gains arise from 'human attention dynamics' rather than incidental re-weighting of the training distribution.
Authors: We agree that the original description of the fixation-to-weight mapping was insufficiently detailed and could invite the interpretation raised by the referee. In the revised manuscript we have expanded the Method section with a dedicated subsection and pseudocode that specifies the following rules: (1) fixations falling on whitespace or inter-token spaces are attributed to the nearest preceding token by character offset within the line; (2) fixations landing inside comments are retained because they frequently mark regions of active comprehension; (3) raw fixation duration is used directly as the weight contribution without normalization by token length or visual prominence, as our internal validation showed stronger alignment with human-reported salience when duration is left unadjusted. We have also added a figure that illustrates the mapping on a sample code snippet. Regarding the ablation claim, we note that the reported gains are measured against both uniform re-weighting and randomly permuted human weights; the fact that only the original human-derived ordering produces the observed improvements supports that the benefit is tied to the specific attention dynamics captured by the eye-tracking data rather than generic re-weighting. We acknowledge that layout artifacts remain a possible confounding factor and have added a short limitations paragraph discussing this point. revision: yes
Circularity Check
No circularity: external eye-tracking data grounds the attention weights independently of model outputs or self-citations
full rationale
The paper's core derivation extracts token-level attention weights directly from external eye-tracking scan paths collected from human developers, then uses these weights to augment the fine-tuning loss. This chain does not reduce any claimed prediction or performance gain to a fitted parameter inside the paper's own equations, nor does it rely on self-citations for uniqueness or ansatz. Ablation studies compare against baselines using the same external signal, and results are reported on held-out tasks (translation, summarization) without tautological re-derivation of the input weights. The method is therefore self-contained against external benchmarks rather than internally forced.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human eye-tracking scan paths yield token-level attention weights that reflect semantic salience in code.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We extract scan paths from eye-tracking data, which reflects the order in which tokens are read by humans. We use these scan paths to assign attention weights to each token in the input samples used to train an LLM.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
w_j = w_base + 1/log(freq(g_j)+2) + E[θ_sj]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Silvia Abrahão, John Grundy, Mauro Pezzè, Margaret-Anne Storey, and Damian A. Tamburri. 2025. Software Engineering by and for Humans in an AI Era. ACM Transactions on Software Engineering and Methodology 34, 5 (June 2025), 1–46. https://doi.org/10.1145/3715111
-
[2]
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, et al. 2022. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. https: //doi.org/10.48550/arXiv.2204.05862
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862 2022
-
[3]
Aakash Bansal, Bonita Sharif, and Collin McMillan. 2023. Towards modeling human attention from eye movements for neural source code summarization. Proceedings of the ACM on Human-Computer Interaction 7, ETRA (2023), 1–19
work page 2023
-
[4]
Aakash Bansal, Chia-Yi Su, Zachary Karas, Yifan Zhang, Yu Huang, Toby Jia- Jun Li, and Collin McMillan. 2023. Modeling programmer attention as scanpath prediction. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1732–1736
work page 2023
-
[5]
I. Bertram, J. Hong, Y. Huang, W. Weimer, and Z. Sharafi. 2020. Trustworthi- ness perceptions in code review: An eye-tracking study. In Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). ACM, 31. https://doi.org/10.1145/3382494.3422164
-
[6]
Tara Capel and Margot Brereton. 2023. What is human-centered about human- centered AI? A map of the research landscape. In Proceedings of the 2023 CHI conference on human factors in computing systems . 1–23
work page 2023
-
[7]
Deep reinforcement learning from human preferences
Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2023. Deep Reinforcement Learning from Human Preferences. https: //doi.org/10.48550/arXiv.1706.03741
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03741 2023
-
[8]
Tristan Coignion, Clément Quinton, and Romain Rouvoy. 2024. A performance study of llm-generated code on leetcode. In Proceedings of the 28th international conference on evaluation and assessment in software engineering . 79–89
work page 2024
-
[9]
Min, Gail Kaiser, Junfeng Yang, and Baishakhi Ray
Yangruibo Ding, Jinjun Peng, Marcus J. Min, Gail Kaiser, Junfeng Yang, and Baishakhi Ray. 2024. SemCoder: Training Code Language Models with Compre- hensive Semantics Reasoning. arXiv:2406.01006 [cs.CL] https://arxiv.org/abs/ 2406.01006
-
[10]
Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Wei Shen, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Xuanjing Huang, and Tao Gui. 2024. StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback. https://doi.org/10.48550/arXiv.2402.01391
-
[11]
Aryaz Eghbali and Michael Pradel. 2022. CrystalBLEU: precisely and efficiently measuring the similarity of code. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering . 1–12
work page 2022
-
[12]
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. KTO: Model Alignment as Prospect Theoretic Optimization. https: //doi.org/10.48550/arXiv.2402.01306
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.01306 2024
-
[13]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[14]
Lisa Grabinger, Florian Hauser, Christian Wolff, and Jürgen Mottok. 2024. On eye tracking in software engineering. SN Computer Science 5, 6 (July 2024), 729. https://doi.org/10.1007/s42979-024-03045-3
-
[15]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Kadian, et al. 2024. The Llama 3 Herd of Models. https://doi.org/10.48550/arXiv.2407. 21783
-
[16]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guant- ing Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang
-
[17]
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
DeepSeek-coder: When the Large Language Model Meets Programming – the Rise of Code Intelligence. https://doi.org/10.48550/arXiv.2401.14196
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.14196
-
[18]
Yucan Guo, Zixuan Li, Xiaolong Jin, Yantao Liu, Yutao Zeng, Wenxuan Liu, Xiang Li, Pan Yang, Long Bai, Jiafeng Guo, et al . 2024. Retrieval-augmented code generation for universal information extraction. In CCF International Conference on Natural Language Processing and Chinese Computing . Springer, 30–42
work page 2024
-
[19]
E. Harth and P. Dugerdil. 2017. Program understanding models: An historical overview and a classification. In Proceedings of the 12th International Conference on Software Technologies (ICSOFT), Vol. 1. SciTePress, 402–413. https://doi.org/ 10.5220/0006465504020413
- [20]
- [21]
- [22]
-
[23]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685. https://arxiv.org/abs/2106. 09685
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[24]
Y. Huang, K. Leach, Z. Sharafi, T. Santander, and W. Weimer. 2020. Biases and differences in code review using medical imaging and eye-tracking: genders, humans, and machines. In Proceedings of the 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 456–468. https://doi.org/10.1...
-
[25]
Dominik Huber, Matteo Paltenghi, and Michael Pradel. 2023. Where to Look When Repairing Code? Comparing the Attention of Neural Models and Develop- ers. https://doi.org/10.48550/arXiv.2305.07287
-
[26]
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. https://doi.org/10. 48550/arXiv.2406.00515
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Ruili Jiang, Kehai Chen, Xuefeng Bai, Zhixuan He, Juntao Li, Muyun Yang, Tiejun Zhao, Liqiang Nie, and Min Zhang. 2024. A Survey on Human Preference Learning for Large Language Models. https://doi.org/10.48550/arXiv.2406.11191
- [28]
-
[29]
Zachary Karas, Aakash Bansal, Yifan Zhang, Toby Li, Collin McMillan, and Yu Huang. 2024. A Tale of Two Comprehensions? Analyzing Student Programmer Attention during Code Summarization.ACM Transactions on Software Engineering and Methodology (2024)
work page 2024
-
[30]
Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. 2023. Understanding the effects of rlhf on llm generalisation and diversity. arXiv preprint arXiv:2310.06452 (2023)
work page internal anchor Pith review arXiv 2023
-
[31]
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Akiki, et al. 2023. StarCoder: May the Source Be with You! https://doi.org/10.48550/arXiv.2305.06161
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.06161 2023
-
[32]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. https://doi.org/10.48550/arXiv.2305. 01210
-
[33]
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambro- sio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al . 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[35]
Meta AI. 2024. Llama 3.2: Revolutionizing edge AI and vision with open, cus- tomizable models. Meta AI Blog. https://ai.meta.com/blog/llama-3-2-connect- 2024-vision-edge-mobile-devices/
work page 2024
-
[36]
Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an llm to help with code understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering . 1–13
work page 2024
-
[37]
Thu-Trang Nguyen, Thanh Trong Vu, Hieu Dinh Vo, and Son Nguyen. 2024. An Empirical Study on Capability of Large Language Models in Understanding Code Semantics. https://doi.org/10.48550/arXiv.2407.03611
-
[38]
M. P. O’Brien. 2003. Software comprehension: A review and research direction . Technical Report. Department of Computer Science & Information Systems, University of Limerick
work page 2003
-
[39]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Train- ing Language Models to Follow Instructions with Human F...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022
-
[40]
Matteo Paltenghi and Michael Pradel. 2021. Thinking like a Developer? Compar- ing the Attention of Humans with Neural Models of Code. In2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, Mel- bourne, Australia, 867–879. https://doi.org/10.1109/ase51524.2021.9678712
- [41]
-
[42]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2024. Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. https://doi.org/10.48550/arXiv.2305.18290
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.18290 2024
-
[43]
Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundare- san, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[44]
Pedro Rodeghero, Chao Liu, Peter W. McBurney, and Collin McMillan. 2014. Improving automated source code summarization via an eye-tracking study of programmers. In Proceedings of the 36th International Conference on Software Engineering (ICSE ’14). ACM, 390–401. https://doi.org/10.1145/2568225.2568247 Conference’17, July 2017, Washington, DC, USA Yifan Zh...
-
[45]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
-
[46]
Proximal Policy Optimization Algorithms
Proximal Policy Optimization Algorithms. https://doi.org/10.48550/arXiv. 1707.06347
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
-
[47]
Carsten Schulte, Tony Clear, Ahmad Taherkhani, Teresa Busjahn, and James H. Paterson. 2010. An introduction to program comprehension for computer science educators. In Proceedings of the 2010 ITiCSE Working Group Reports (ITiCSE -WGR ’10), Alison Clear and Lori Russell Dag (Eds.). ACM, 65–86. https://doi.org/10. 1145/1971681.1971687
-
[48]
C. Schulte, T. Clear, A. Taherkhani, T. Busjahn, and J. H. Paterson. 2010. An introduction to program comprehension for computer science educators. In Proceedings of the 2010 ITiCSE Working Group Reports (ITiCSE–WGR ’10), A. Clear and L. R. Dag (Eds.). ACM, 65–86
work page 2010
-
[49]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeek- Math: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300 [cs.CL] https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Z. Sharafi, Y. Huang, K. Leach, and W. Weimer. 2021. Toward an objective measure of developers’ cognitive activities. ACM Transactions on Software Engineering and Methodology 30, 3 (2021), 30:1–30:30
work page 2021
-
[51]
Zohreh Sharafi, Timothy Shaffer, Bonita Sharif, and Yann-Gaël Guéhéneuc. 2015. Eye-tracking metrics in software engineering. In 2015 Asia-Pacific Software Engi- neering Conference (APSEC). 96–103. https://doi.org/10.1109/APSEC.2015.53
-
[52]
Z. Sharafi, B. Sharif, Y.-G. Guéhéneuc, A. Begel, R. Bednarik, and M. Crosby. 2020. A practical guide on conducting eye-tracking studies in software engineering. Empirical Software Engineering 25, 5 (2020), 3128–3174. https://doi.org/10.1007/ s10664-020-09829-4
work page 2020
-
[53]
Zohreh Sharafi, Bonita Sharif, Yann-Gaël Guéhéneuc, Andrew Begel, Roman Bednarik, and Martha Crosby. 2020. A practical guide on conducting eye tracking studies in software engineering. Empirical Software Engineering 25, 5 (Sept. 2020), 3128–3174. https://doi.org/10.1007/s10664-020-09829-4
-
[54]
Zohreh Sharafi, Zéphyrin Soh, and Yann-Gaël Guéhéneuc. 2015. A systematic literature review on the usage of eye-tracking in software engineering. Informa- tion and Software Technology 67 (Nov. 2015), 79–107. https://doi.org/10.1016/j. infsof.2015.06.008
work page doi:10.1016/j 2015
-
[55]
Bonita Sharif, Mark Falcone, and Jonathan I. Maletic. 2012. An eye -tracking study on the role of scan time in finding source code defects. In Proceedings of the Symposium on Eye Tracking Research and Applications . ACM, 381–384. https://doi.org/10.1145/2168556.2168642
-
[56]
Bonita Sharif and Huzefa Kagdi. 2011. On the use of eye tracking in software traceability. In Proceedings of the 6th International Workshop on Traceability in Emerging Forms of Software Engineering (TEFSE ’11) . ACM, 67–70. https: //doi.org/10.1145/1987856.1987872
-
[57]
Bonita Sharif, Jeff Meinken, Timothy Shaffer, and Huzefa Kagdi. 2017. Eye movements in software traceability link recovery.Empirical Software Engineering 22, 3 (2017), 1063–1102. https://doi.org/10.1007/s10664-016-9486-9
-
[58]
Mohammed Latif Siddiq, Lindsay Roney, Jiahao Zhang, and Joanna Cecilia Da Silva Santos. 2024. Quality Assessment of ChatGPT Generated Code and Their Use by Developers. In Proceedings of the 21st International Confer- ence on Mining Software Repositories . ACM, Lisbon Portugal, 152–156. https: //doi.org/10.1145/3643991.3645071
-
[59]
Stewart Slocum, Asher Parker-Sartori, and Dylan Hadfield-Menell. 2025. Diverse preference learning for capabilities and alignment. InThe Thirteenth International Conference on Learning Representations
work page 2025
-
[60]
E. D. Tempero and Y.-C. Tu. 2024. Using program comprehension models to teach comprehensibility. In Proceedings of the ACE 2024: Australian Computing Education Conference. ACM, 1–10. https://doi.org/10.1145/3636243.3636244
-
[61]
Usman Ahmad Usmani, Ari Happonen, and Junzo Watada. 2023. Human-centered artificial intelligence: Designing for user empowerment and ethical considera- tions. In 2023 5th international congress on human-computer interaction, optimiza- tion and robotic applications (HORA) . IEEE, 1–7
work page 2023
-
[62]
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. 2024. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 8228–8238
work page 2024
-
[63]
Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning
work page 2024
-
[64]
Yuanhao Wang, Qinghua Liu, and Chi Jin. 2023. Is rlhf more difficult than standard rl? a theoretical perspective. Advances in Neural Information Processing Systems 36 (2023), 76006–76032
work page 2023
-
[65]
Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, et al. 2024. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more. arXiv preprint arXiv:2407.16216 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [66]
- [67]
-
[68]
Wei Xu, Marvin J Dainoff, Liezhong Ge, and Zaifeng Gao. 2023. Transitioning to human interaction with AI systems: New challenges and opportunities for HCI professionals to enable human-centered AI. International Journal of Human– Computer Interaction 39, 3 (2023), 494–518
work page 2023
-
[69]
Wenda Xu, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Biao Zhang, Zhong- tao Liu, William Yang Wang, Lei Li, and Markus Freitag. 2024. LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback. https://doi.org/10.48550/arXiv.2311.09336
-
[70]
Zezhou Yang, Sirong Chen, Cuiyun Gao, Zhenhao Li, Xing Hu, Kui Liu, and Xin Xia. 2025. An empirical study of retrieval-augmented code generation: Challenges and opportunities. ACM Transactions on Software Engineering and Methodology (2025)
work page 2025
-
[71]
Yifan Zhang, Jiliang Li, Zachary Karas, Aakash Bansal, Toby Jia-Jun Li, Collin McMillan, Kevin Leach, and Yu Huang. 2024. Eyetrans: Merging human and machine attention for neural code summarization. Proceedings of the ACM on Software Engineering 1, FSE (2024), 115–136
work page 2024
-
[72]
Ziyao Zhang, Yanlin Wang, Chong Wang, Jiachi Chen, and Zibin Zheng. 2025. LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation. https://doi.org/10.48550/arXiv.2409.20550
-
[73]
Li Zhong and Zilong Wang. 2024. Can llm replace stack overflow? a study on robustness and reliability of large language model code generation. InProceedings of the AAAI conference on artificial intelligence , Vol. 38. 21841–21849
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.