arxiv: 2601.13398 · v2 · submitted 2026-01-19 · 💻 cs.LG · cs.AI· cs.PL

Recognition: no theorem link

Can LLMs Compress (and Decompress)? Evaluating Code Understanding and Execution via Invertibility

Nickil Maveli , Antonio Vergari , Shay B. Cohen

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.PL

keywords large language modelscode reasoninground-trip consistencylossless compressionbidirectional reasoningbenchmark

0 comments

The pith

Large language models lack the internal coherence required for reliable bidirectional code reasoning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLMs perform well on standard code tasks but cannot reliably compress code and then restore the exact original through decompression. It tests this with a new benchmark of round-trip tasks built on lossless compression algorithms, where success demands that forward and backward steps match perfectly. Zero-shot use, fine-tuning on execution traces, and self-reflection all produce only small gains and leave large gaps. A reader would care because this gap appears in simple cases and in models that pass one-way tests, meaning current approaches do not produce consistent internal code representations.

Core claim

The paper introduces RoundTripCodeEval, a benchmark of four code execution reasoning tasks that measures round-trip consistency through exact-match evaluation of bijection fidelity on lossless compression algorithms. Experiments on state-of-the-art Code-LLMs under zero-shot prompting, supervised fine-tuning, and iterative self-reflection show only modest improvements with none closing the performance gap, revealing that current LLMs lack the internal coherence required for reliable bidirectional code reasoning. Models often succeed on separate forward or backward tasks yet fail when both are required together, and the same limitations appear even on simple bijections such as run-length编码.

What carries the argument

RoundTripCodeEval benchmark that measures bijection fidelity through exact-match round-trip assessment on lossless compression algorithms

If this is right

Models succeed on separate forward and backward tasks but fail the combined round-trip, exposing mutually inconsistent internal representations
Supervised fine-tuning and self-reflection produce modest gains and saturate after one revision round without repairing core algorithmic misunderstandings
Failures occur even on simple bijections such as run-length encoding, showing that algorithmic complexity is not the only cause
Standard one-way code benchmarks miss these consistency failures

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training methods that explicitly reward round-trip invertibility could produce more coherent internal representations
The same consistency requirement could be applied to other reversible tasks such as mathematical derivations or data transformations
Developers using LLMs for code generation may need additional verification steps to catch inversion errors

Load-bearing premise

That round-trip exact-match failure on bijections directly indicates lack of internal coherence rather than surface-level issues with prompting or output formatting

What would settle it

Finding a model that reaches near-perfect exact-match accuracy on the full round-trip compression tasks across the four algorithms would falsify the claim

Figures

Figures reproduced from arXiv: 2601.13398 by Antonio Vergari, Nickil Maveli, Shay B. Cohen.

**Figure 2.** Figure 2: Overview of the reasoning tasks which depict our four-step round-trip procedure for assessing code [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Multi-turn revision for AE on a subset of the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: A concrete task example outlining the work [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Input prediction prompt template for RLE algorithm. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Input prediction with inversion prompt template for RLE algorithm. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Output prediction prompt template for RLE algorithm. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Output prediction with inversion prompt template for RLE algorithm. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Radial plot for AE [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Radial plot for LZW [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Radial plot for RLE [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Radial plot for Huffman [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Input length vs pass-5 for AE [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Input length vs pass@5 for LZW [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Input length vs pass@5 for RLE [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Input length vs pass@5 for Huffman [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: Prompt difficulty under AE. 0 1-2 3-5 6-10 11-15 >15 Number of models passing (pass@5) 0 2 4 6 8 10 12 Number of input data prompts Input Execution Prediction 0 1-2 3-5 6-10 11-15 >15 Number of models passing (pass@5) 0 2 4 6 8 10 12 Number of input data prompts Output Execution Prediction 0 1-2 3-5 6-10 11-15 >15 Number of models passing (pass@5) 0 2 4 6 8 10 12 Number of input data prompts Output Execut… view at source ↗

**Figure 18.** Figure 18: Prompt difficulty under LZW [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt difficulty under RLE. 0 1-2 3-5 6-10 11-15 >15 Number of models passing (pass@5) 0 2 4 6 8 10 12 Number of input data prompts Input Execution Prediction 0 1-2 3-5 6-10 11-15 >15 Number of models passing (pass@5) 0 2 4 6 8 10 12 Number of input data prompts Output Execution Prediction 0 1-2 3-5 6-10 11-15 >15 Number of models passing (pass@5) 0 2 4 6 8 10 12 Number of input data prompts Output Execu… view at source ↗

**Figure 20.** Figure 20: Prompt difficulty under Huffman [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗

read the original abstract

LLMs demonstrate strong performance on code benchmarks, yet consistent reasoning across forward and backward execution remains elusive. We present RoundTripCodeEval (RTCE), a benchmark of four code execution reasoning tasks that evaluates round-trip consistency through execution-free, exact-match assessment of bijection fidelity across four lossless compression algorithms. We evaluate state-of-the-art Code-LLMs under zero-shot prompting, supervised fine-tuning on execution traces, and iterative self-reflection. All approaches yield only modest improvements and none closes the gap, revealing that current LLMs lack the internal coherence required for reliable bidirectional code reasoning. RTCE surfaces findings invisible to existing benchmarks: models frequently pass individual forward and backward tasks yet fail the combined round-trip, exposing mutually inconsistent internal representations; SFT and self-reflection saturate after one revision round, indicating they cannot repair fundamental algorithmic misunderstandings; and failures persist even on simple bijections such as RLE, suggesting that algorithmic complexity is not the sole root cause.\footnote{Code and dataset are available at https://github.com/Nickil21/round-trip-code-compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RTCE benchmark catches round-trip failures on code bijections that one-way tests miss, but exact-match without normalization may inflate the coherence problem.

read the letter

The main takeaway is that this paper introduces RTCE, a benchmark that tests LLMs on round-trip invertibility for lossless code compressions like RLE, and finds that zero-shot, SFT on traces, and one round of self-reflection all leave big gaps. Models can handle compression or decompression separately but often fail when both directions must match exactly, and the improvements plateau quickly. That pattern is new and useful because it exposes consistency issues invisible to standard forward-only code benchmarks. The work is straightforward: they pick four bijections, run the models, and report that nothing closes the gap, with code and data released. Credit for making the evaluation execution-free and for showing the saturation effect. The soft spot is the jump to claiming absent internal coherence. The stress-test point holds: without reported normalization or equivalence checking for whitespace, variable names, or equivalent constructs, exact string match can fail on outputs that are semantically correct. The abstract gives no numbers, error bars, or exclusion rules, so the size of the modest gains is hard to judge. This is for researchers building or evaluating code LLMs who care about bidirectional reasoning. A reader working on training methods or benchmarks would get concrete ideas from the RTCE setup. It deserves peer review because the benchmark framing is fresh and the consistency failures are worth verifying with fuller methods details.

Referee Report

2 major / 1 minor

Summary. The paper introduces RoundTripCodeEval (RTCE), a benchmark of four code execution reasoning tasks that evaluates LLMs' round-trip consistency on lossless compression bijections (including RLE) via execution-free exact string match. It tests state-of-the-art Code-LLMs under zero-shot prompting, supervised fine-tuning on execution traces, and iterative self-reflection, reporting only modest improvements that saturate after one round and persistent failures even on simple bijections. The authors conclude that current LLMs lack the internal coherence required for reliable bidirectional code reasoning, as models often succeed on isolated forward/backward tasks but fail combined round-trips.

Significance. If the evaluation protocol is strengthened to rule out surface-level confounds, the work would usefully expose a limitation in LLMs' code understanding that standard benchmarks miss, highlighting the gap between passing individual tasks and achieving consistent invertible representations. The public release of code and dataset supports reproducibility.

major comments (2)

[Evaluation protocol] Evaluation protocol (described in abstract and methods): the central claim that round-trip exact-match failures demonstrate 'mutually inconsistent internal representations' and 'lack of internal coherence' rests on an unnormalized exact string match. Without reported canonicalization, whitespace normalization, variable renaming, or semantic equivalence oracles, deviations could arise from generation formatting rather than representational inconsistency, directly weakening the interpretation of modest improvements and saturation as evidence of fundamental algorithmic misunderstanding.
[Experimental results] Experimental results (abstract and § on results): the reported 'modest improvements' and 'saturation after one revision round' are presented without quantitative metrics, error bars, statistical significance, or data exclusion rules. This makes it impossible to assess effect sizes or whether the persistent gap on simple bijections like RLE is robust, which is load-bearing for the claim that no approach closes the gap.

minor comments (1)

[Abstract] The abstract footnote provides a GitHub link for code and dataset; ensure the repository includes the exact prompts, trace data, and evaluation scripts used to allow full reproduction of the round-trip protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the evaluation and reporting in our work. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Evaluation protocol] Evaluation protocol (described in abstract and methods): the central claim that round-trip exact-match failures demonstrate 'mutually inconsistent internal representations' and 'lack of internal coherence' rests on an unnormalized exact string match. Without reported canonicalization, whitespace normalization, variable renaming, or semantic equivalence oracles, deviations could arise from generation formatting rather than representational inconsistency, directly weakening the interpretation of modest improvements and saturation as evidence of fundamental algorithmic misunderstanding.

Authors: We agree that unnormalized exact string matching can introduce potential confounds from formatting variations. In the revised manuscript, we have added explicit canonicalization steps including whitespace normalization and variable renaming, and we now report round-trip results both with and without these normalizations. The persistent failures remain statistically evident even after normalization, supporting the interpretation of inconsistent internal representations. We note that full semantic equivalence oracles for arbitrary code are computationally intractable in this setting and have added a limitations discussion to this effect. revision: yes
Referee: [Experimental results] Experimental results (abstract and § on results): the reported 'modest improvements' and 'saturation after one revision round' are presented without quantitative metrics, error bars, statistical significance, or data exclusion rules. This makes it impossible to assess effect sizes or whether the persistent gap on simple bijections like RLE is robust, which is load-bearing for the claim that no approach closes the gap.

Authors: We acknowledge that the original presentation lacked sufficient quantitative detail. The revised results section now includes mean performance metrics with standard deviations across five independent runs, error bars on all figures, paired t-test results for significance, and explicit rules for data exclusion (e.g., discarding malformed generations). These additions confirm that improvements remain modest, saturate after one iteration, and that the gap on RLE is robust and statistically significant. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no circular derivation

full rationale

The paper introduces the RTCE benchmark consisting of four code execution reasoning tasks based on lossless bijections (including RLE) and evaluates LLMs via zero-shot prompting, SFT on traces, and self-reflection using execution-free exact string match. The central finding—that models fail to achieve reliable round-trip consistency—is derived directly from the observed performance gaps on this new benchmark. No equations, fitted parameters, or self-citation chains reduce any claim to prior results by construction. The evaluation protocol and conclusions are independent of any self-referential definitions or renamings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that exact-match fidelity on lossless bijections is a valid proxy for internal code coherence; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Round-trip exact-match success on compression bijections measures internal coherence of code understanding in LLMs.
Invoked to interpret failures as evidence of lacking coherence rather than prompt or format issues.

pith-pipeline@v0.9.0 · 5494 in / 1118 out tokens · 26202 ms · 2026-05-16T12:45:37.287683+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLMs Corrupt Your Documents When You Delegate
cs.CL 2026-04 unverdicted novelty 6.0

LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Evaluating large language models trained on code. S. W. Golomb. 1966. Run-length encodings.IEEE Transactions on Information Theory, 12(3):399–401. Alex Gu, Baptiste Roziere, Hugh James Leather, Ar- mando Solar-Lezama, Gabriel Synnaeve, and Sida Wang. 2024. CRUXEval: A benchmark for code reasoning, understanding and execution. InProceed- ings of the 41st I...

work page internal anchor Pith review Pith/arXiv arXiv 1966
[2]

Association for Computational Linguistics

Can large language models detect errors in long chain-of-thought reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18468–18489, Vienna, Austria. Association for Computational Linguistics. Ashish Hooda, Mihai Christodorescu, Miltiadis Allama- nis, Aaron Wilson, Kassem Fawaz, ...

work page
[3]

Qwen2.5-Coder Technical Report

Do large code models understand program- ming concepts? counterfactual analysis for code pred- icates. InProceedings of the 41st International Con- ference on Machine Learning, ICML’24. JMLR.org. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adaptation of large language m...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Starcoder 2 and the stack v2: The next genera- tion.Preprint, arXiv:2402.19173. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad Majumder, Shashank Gupta, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-refine: Iterative refinement ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

In The Twelfth International Conference on Learning Representations

Beyond accuracy: Evaluating self-consistency of code large language models with identitychain. In The Twelfth International Conference on Learning Representations. Ruikang Ni, Da Xiao, Qingye Meng, Xiangyu Li, Shihui Zheng, and Hongliang Liang. 2025. Benchmarking and understanding compositional relational reasoning of llms.Proceedings of the AAAI Conferen...

work page 2025
[6]

aaabbbcc

OpenCodeInterpreter: Integrating code gener- ation with execution and refinement. InFindings of the Association for Computational Linguistics: ACL 2024, pages 12834–12859, Bangkok, Thailand. As- sociation for Computational Linguistics. A Concrete task example Concrete Task Example Consider a simple compression transformation using run- length encoding (RL...

work page 2024
[11]

bhihia". [/THOUGHT] [ANSWER] {

The return value of the function is therefore"bhihia". [/THOUGHT] [ANSWER] {"output": "bhihia"} [/ANSWER] The input and output requirements are as follows: Input:uncompressed(str): The input string to be compressed. Output:return(list of tuple): A list of(char, count)tuples representing the RLE-compressed string. Given the following output: <output> Given...

work page
[16]

bhihia". [/THOUGHT] [ANSWER] {

The return value of the function is therefore"bhihia". [/THOUGHT] [ANSWER] {"output": "bhihia"} [/ANSWER] The input and output requirements are as follows: Input:uncompressed(str): The input string to be compressed. Output:return(list of tuple): A list of(char, count)tuples representing the RLE-compressed string. Given the following output: <output> Given...

work page
[21]

bhihia". [/THOUGHT] [ANSWER] {

The return value of the function is therefore"bhihia". [/THOUGHT] [ANSWER] {"output": "bhihia"} [/ANSWER] The input and output requirements are as follows: Input:uncompressed(str): The input string to be compressed. Output:return(list of tuple): A list of(char, count)tuples representing the RLE-compressed string. Given the following input: <input> Given t...

work page
[22]

The functionmain_solutionis defined, which takes a single arguments

work page
[23]

hi", so within the function,sis initially

The function is called with the argument"hi", so within the function,sis initially"hi"

work page
[24]

Inside the function,sis concatenated with itself, sosbecomes"hihi"

work page
[25]

b", followed by the value ofs (which is now

The function then returns a new string that starts with "b", followed by the value ofs (which is now"hihi"), and ends with"a"

work page
[26]

bhihia". [/THOUGHT] [ANSWER] {

The return value of the function is therefore"bhihia". [/THOUGHT] [ANSWER] {"output": "bhihia"} [/ANSWER] The input and output requirements are as follows: Input:uncompressed(str): The input string to be compressed. Output:return(list of tuple): A list of(char, count)tuples representing the RLE-compressed string. Given the following input: <input> Given t...

work page 2025