pith. machine review for the scientific record. sign in

arxiv: 2601.13398 · v2 · submitted 2026-01-19 · 💻 cs.LG · cs.AI· cs.PL

Recognition: no theorem link

Can LLMs Compress (and Decompress)? Evaluating Code Understanding and Execution via Invertibility

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.PL
keywords large language modelscode reasoninground-trip consistencylossless compressionbidirectional reasoningbenchmark
0
0 comments X

The pith

Large language models lack the internal coherence required for reliable bidirectional code reasoning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLMs perform well on standard code tasks but cannot reliably compress code and then restore the exact original through decompression. It tests this with a new benchmark of round-trip tasks built on lossless compression algorithms, where success demands that forward and backward steps match perfectly. Zero-shot use, fine-tuning on execution traces, and self-reflection all produce only small gains and leave large gaps. A reader would care because this gap appears in simple cases and in models that pass one-way tests, meaning current approaches do not produce consistent internal code representations.

Core claim

The paper introduces RoundTripCodeEval, a benchmark of four code execution reasoning tasks that measures round-trip consistency through exact-match evaluation of bijection fidelity on lossless compression algorithms. Experiments on state-of-the-art Code-LLMs under zero-shot prompting, supervised fine-tuning, and iterative self-reflection show only modest improvements with none closing the performance gap, revealing that current LLMs lack the internal coherence required for reliable bidirectional code reasoning. Models often succeed on separate forward or backward tasks yet fail when both are required together, and the same limitations appear even on simple bijections such as run-length编码.

What carries the argument

RoundTripCodeEval benchmark that measures bijection fidelity through exact-match round-trip assessment on lossless compression algorithms

If this is right

  • Models succeed on separate forward and backward tasks but fail the combined round-trip, exposing mutually inconsistent internal representations
  • Supervised fine-tuning and self-reflection produce modest gains and saturate after one revision round without repairing core algorithmic misunderstandings
  • Failures occur even on simple bijections such as run-length encoding, showing that algorithmic complexity is not the only cause
  • Standard one-way code benchmarks miss these consistency failures

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training methods that explicitly reward round-trip invertibility could produce more coherent internal representations
  • The same consistency requirement could be applied to other reversible tasks such as mathematical derivations or data transformations
  • Developers using LLMs for code generation may need additional verification steps to catch inversion errors

Load-bearing premise

That round-trip exact-match failure on bijections directly indicates lack of internal coherence rather than surface-level issues with prompting or output formatting

What would settle it

Finding a model that reaches near-perfect exact-match accuracy on the full round-trip compression tasks across the four algorithms would falsify the claim

Figures

Figures reproduced from arXiv: 2601.13398 by Antonio Vergari, Nickil Maveli, Shay B. Cohen.

Figure 1
Figure 1. Figure 1: A standard lossless compression pipeline, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the reasoning tasks which depict our four-step round-trip procedure for assessing code [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Multi-turn revision for AE on a subset of the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A concrete task example outlining the work [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Input prediction prompt template for RLE algorithm. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Input prediction with inversion prompt template for RLE algorithm. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Output prediction prompt template for RLE algorithm. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Output prediction with inversion prompt template for RLE algorithm. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Radial plot for AE [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Radial plot for LZW [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Radial plot for RLE [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Radial plot for Huffman [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Input length vs pass-5 for AE [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Input length vs pass@5 for LZW [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Input length vs pass@5 for RLE [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Input length vs pass@5 for Huffman [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt difficulty under AE. 0 1-2 3-5 6-10 11-15 >15 Number of models passing (pass@5) 0 2 4 6 8 10 12 Number of input data prompts Input Execution Prediction 0 1-2 3-5 6-10 11-15 >15 Number of models passing (pass@5) 0 2 4 6 8 10 12 Number of input data prompts Output Execution Prediction 0 1-2 3-5 6-10 11-15 >15 Number of models passing (pass@5) 0 2 4 6 8 10 12 Number of input data prompts Output Execut… view at source ↗
Figure 18
Figure 18. Figure 18: Prompt difficulty under LZW [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Prompt difficulty under RLE. 0 1-2 3-5 6-10 11-15 >15 Number of models passing (pass@5) 0 2 4 6 8 10 12 Number of input data prompts Input Execution Prediction 0 1-2 3-5 6-10 11-15 >15 Number of models passing (pass@5) 0 2 4 6 8 10 12 Number of input data prompts Output Execution Prediction 0 1-2 3-5 6-10 11-15 >15 Number of models passing (pass@5) 0 2 4 6 8 10 12 Number of input data prompts Output Execu… view at source ↗
Figure 20
Figure 20. Figure 20: Prompt difficulty under Huffman [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗
read the original abstract

LLMs demonstrate strong performance on code benchmarks, yet consistent reasoning across forward and backward execution remains elusive. We present RoundTripCodeEval (RTCE), a benchmark of four code execution reasoning tasks that evaluates round-trip consistency through execution-free, exact-match assessment of bijection fidelity across four lossless compression algorithms. We evaluate state-of-the-art Code-LLMs under zero-shot prompting, supervised fine-tuning on execution traces, and iterative self-reflection. All approaches yield only modest improvements and none closes the gap, revealing that current LLMs lack the internal coherence required for reliable bidirectional code reasoning. RTCE surfaces findings invisible to existing benchmarks: models frequently pass individual forward and backward tasks yet fail the combined round-trip, exposing mutually inconsistent internal representations; SFT and self-reflection saturate after one revision round, indicating they cannot repair fundamental algorithmic misunderstandings; and failures persist even on simple bijections such as RLE, suggesting that algorithmic complexity is not the sole root cause.\footnote{Code and dataset are available at https://github.com/Nickil21/round-trip-code-compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces RoundTripCodeEval (RTCE), a benchmark of four code execution reasoning tasks that evaluates LLMs' round-trip consistency on lossless compression bijections (including RLE) via execution-free exact string match. It tests state-of-the-art Code-LLMs under zero-shot prompting, supervised fine-tuning on execution traces, and iterative self-reflection, reporting only modest improvements that saturate after one round and persistent failures even on simple bijections. The authors conclude that current LLMs lack the internal coherence required for reliable bidirectional code reasoning, as models often succeed on isolated forward/backward tasks but fail combined round-trips.

Significance. If the evaluation protocol is strengthened to rule out surface-level confounds, the work would usefully expose a limitation in LLMs' code understanding that standard benchmarks miss, highlighting the gap between passing individual tasks and achieving consistent invertible representations. The public release of code and dataset supports reproducibility.

major comments (2)
  1. [Evaluation protocol] Evaluation protocol (described in abstract and methods): the central claim that round-trip exact-match failures demonstrate 'mutually inconsistent internal representations' and 'lack of internal coherence' rests on an unnormalized exact string match. Without reported canonicalization, whitespace normalization, variable renaming, or semantic equivalence oracles, deviations could arise from generation formatting rather than representational inconsistency, directly weakening the interpretation of modest improvements and saturation as evidence of fundamental algorithmic misunderstanding.
  2. [Experimental results] Experimental results (abstract and § on results): the reported 'modest improvements' and 'saturation after one revision round' are presented without quantitative metrics, error bars, statistical significance, or data exclusion rules. This makes it impossible to assess effect sizes or whether the persistent gap on simple bijections like RLE is robust, which is load-bearing for the claim that no approach closes the gap.
minor comments (1)
  1. [Abstract] The abstract footnote provides a GitHub link for code and dataset; ensure the repository includes the exact prompts, trace data, and evaluation scripts used to allow full reproduction of the round-trip protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the evaluation and reporting in our work. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Evaluation protocol] Evaluation protocol (described in abstract and methods): the central claim that round-trip exact-match failures demonstrate 'mutually inconsistent internal representations' and 'lack of internal coherence' rests on an unnormalized exact string match. Without reported canonicalization, whitespace normalization, variable renaming, or semantic equivalence oracles, deviations could arise from generation formatting rather than representational inconsistency, directly weakening the interpretation of modest improvements and saturation as evidence of fundamental algorithmic misunderstanding.

    Authors: We agree that unnormalized exact string matching can introduce potential confounds from formatting variations. In the revised manuscript, we have added explicit canonicalization steps including whitespace normalization and variable renaming, and we now report round-trip results both with and without these normalizations. The persistent failures remain statistically evident even after normalization, supporting the interpretation of inconsistent internal representations. We note that full semantic equivalence oracles for arbitrary code are computationally intractable in this setting and have added a limitations discussion to this effect. revision: yes

  2. Referee: [Experimental results] Experimental results (abstract and § on results): the reported 'modest improvements' and 'saturation after one revision round' are presented without quantitative metrics, error bars, statistical significance, or data exclusion rules. This makes it impossible to assess effect sizes or whether the persistent gap on simple bijections like RLE is robust, which is load-bearing for the claim that no approach closes the gap.

    Authors: We acknowledge that the original presentation lacked sufficient quantitative detail. The revised results section now includes mean performance metrics with standard deviations across five independent runs, error bars on all figures, paired t-test results for significance, and explicit rules for data exclusion (e.g., discarding malformed generations). These additions confirm that improvements remain modest, saturate after one iteration, and that the gap on RLE is robust and statistically significant. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no circular derivation

full rationale

The paper introduces the RTCE benchmark consisting of four code execution reasoning tasks based on lossless bijections (including RLE) and evaluates LLMs via zero-shot prompting, SFT on traces, and self-reflection using execution-free exact string match. The central finding—that models fail to achieve reliable round-trip consistency—is derived directly from the observed performance gaps on this new benchmark. No equations, fitted parameters, or self-citation chains reduce any claim to prior results by construction. The evaluation protocol and conclusions are independent of any self-referential definitions or renamings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that exact-match fidelity on lossless bijections is a valid proxy for internal code coherence; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Round-trip exact-match success on compression bijections measures internal coherence of code understanding in LLMs.
    Invoked to interpret failures as evidence of lacking coherence rather than prompt or format issues.

pith-pipeline@v0.9.0 · 5494 in / 1118 out tokens · 26202 ms · 2026-05-16T12:45:37.287683+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLMs Corrupt Your Documents When You Delegate

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Evaluating large language models trained on code. S. W. Golomb. 1966. Run-length encodings.IEEE Transactions on Information Theory, 12(3):399–401. Alex Gu, Baptiste Roziere, Hugh James Leather, Ar- mando Solar-Lezama, Gabriel Synnaeve, and Sida Wang. 2024. CRUXEval: A benchmark for code reasoning, understanding and execution. InProceed- ings of the 41st I...

  2. [2]

    Association for Computational Linguistics

    Can large language models detect errors in long chain-of-thought reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18468–18489, Vienna, Austria. Association for Computational Linguistics. Ashish Hooda, Mihai Christodorescu, Miltiadis Allama- nis, Aaron Wilson, Kassem Fawaz, ...

  3. [3]

    Qwen2.5-Coder Technical Report

    Do large code models understand program- ming concepts? counterfactual analysis for code pred- icates. InProceedings of the 41st International Con- ference on Machine Learning, ICML’24. JMLR.org. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adaptation of large language m...

  4. [4]

    Starcoder 2 and the stack v2: The next genera- tion.Preprint, arXiv:2402.19173. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad Majumder, Shashank Gupta, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-refine: Iterative refinement ...

  5. [5]

    In The Twelfth International Conference on Learning Representations

    Beyond accuracy: Evaluating self-consistency of code large language models with identitychain. In The Twelfth International Conference on Learning Representations. Ruikang Ni, Da Xiao, Qingye Meng, Xiangyu Li, Shihui Zheng, and Hongliang Liang. 2025. Benchmarking and understanding compositional relational reasoning of llms.Proceedings of the AAAI Conferen...

  6. [6]

    aaabbbcc

    OpenCodeInterpreter: Integrating code gener- ation with execution and refinement. InFindings of the Association for Computational Linguistics: ACL 2024, pages 12834–12859, Bangkok, Thailand. As- sociation for Computational Linguistics. A Concrete task example Concrete Task Example Consider a simple compression transformation using run- length encoding (RL...

  7. [11]

    bhihia". [/THOUGHT] [ANSWER] {

    The return value of the function is therefore"bhihia". [/THOUGHT] [ANSWER] {"output": "bhihia"} [/ANSWER] The input and output requirements are as follows: Input:uncompressed(str): The input string to be compressed. Output:return(list of tuple): A list of(char, count)tuples representing the RLE-compressed string. Given the following output: <output> Given...

  8. [16]

    bhihia". [/THOUGHT] [ANSWER] {

    The return value of the function is therefore"bhihia". [/THOUGHT] [ANSWER] {"output": "bhihia"} [/ANSWER] The input and output requirements are as follows: Input:uncompressed(str): The input string to be compressed. Output:return(list of tuple): A list of(char, count)tuples representing the RLE-compressed string. Given the following output: <output> Given...

  9. [21]

    bhihia". [/THOUGHT] [ANSWER] {

    The return value of the function is therefore"bhihia". [/THOUGHT] [ANSWER] {"output": "bhihia"} [/ANSWER] The input and output requirements are as follows: Input:uncompressed(str): The input string to be compressed. Output:return(list of tuple): A list of(char, count)tuples representing the RLE-compressed string. Given the following input: <input> Given t...

  10. [22]

    The functionmain_solutionis defined, which takes a single arguments

  11. [23]

    hi", so within the function,sis initially

    The function is called with the argument"hi", so within the function,sis initially"hi"

  12. [24]

    Inside the function,sis concatenated with itself, sosbecomes"hihi"

  13. [25]

    b", followed by the value ofs (which is now

    The function then returns a new string that starts with "b", followed by the value ofs (which is now"hihi"), and ends with"a"

  14. [26]

    bhihia". [/THOUGHT] [ANSWER] {

    The return value of the function is therefore"bhihia". [/THOUGHT] [ANSWER] {"output": "bhihia"} [/ANSWER] The input and output requirements are as follows: Input:uncompressed(str): The input string to be compressed. Output:return(list of tuple): A list of(char, count)tuples representing the RLE-compressed string. Given the following input: <input> Given t...