SemChunk-C: Semantic Segmentation for C Code

Boris Nazarov; Darya Frolova; Pavel Kisilev; Shaked Leibzirer

arxiv: 2606.23697 · v1 · pith:AE52SANPnew · submitted 2026-05-12 · 💻 cs.SE · cs.AI· cs.PL

SemChunk-C: Semantic Segmentation for C Code

Boris Nazarov , Darya Frolova , Shaked Leibzirer , Pavel Kisilev This is my paper

Pith reviewed 2026-06-30 22:34 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.PL

keywords semantic chunkingC code segmentationcode boundary detectionfunctional code categorieslightweight language modelsLLM classifiers for codesemantic coherence

0 comments

The pith

SemChunk-C lightweight models match larger LLMs at identifying semantic chunks in C code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains small encoder models to divide C-family source files into flexible, semantically coherent segments and to label each segment with a functional category. It defines a set of such categories and uses an LLM-based approach to learn boundary detection that adapts to each file's structure rather than relying on fixed windows or rigid syntax rules. The resulting models are tested on multiple datasets and shown to reach boundary accuracy and coherence levels that match or exceed those of chunkers built from much larger code-focused language models. The work also reports gains when the chunks feed into downstream tasks such as retrieval.

Core claim

SemChunk-C is a family of lightweight language models with 17M to 150M parameters that identify chunk boundaries in C-related code and assign functional categories, achieving high boundary accuracy and semantic coherence on various datasets while matching or outperforming chunkers based on much larger code-oriented LLMs.

What carries the argument

An LLM-based classifier trained to detect flexible chunk boundaries and assign functional attributes such as data structures or interface blocks.

If this is right

The models handle challenging real-world constructs including nested definitions and macros across .c, .cpp, .h, and .cs files.
Downstream tasks such as retrieval show improved performance on curated benchmarks when using the produced chunks.
Small parameter counts allow efficient semantic segmentation without the compute cost of larger models.
The approach remains robust on irregular structural patterns typical of production C-family codebases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the category labels prove stable, they could support type-aware indexing that lets retrieval systems fetch code by functional role rather than text similarity alone.
The flexible-boundary method may reduce the need for language-specific parsers when moving the technique to other procedural languages.
Lightweight deployment could enable on-device or low-resource code analysis pipelines that currently depend on remote large-model calls.

Load-bearing premise

An LLM-based classifier can reliably identify flexible chunk boundaries and assign meaningful functional categories without post-hoc tuning or dataset-specific adjustments.

What would settle it

Evaluation on a fresh collection of C files containing complex nested macros and irregular structures where boundary accuracy falls below that of larger LLM chunkers.

Figures

Figures reproduced from arXiv: 2606.23697 by Boris Nazarov, Darya Frolova, Pavel Kisilev, Shaked Leibzirer.

**Figure 2.** Figure 2: Examples of semantic chunking for C code. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Semantic segmentation of code written in a C-family language remains a challenging problem, due to the language's complex syntax, macro expansion, and irregular structural patterns. Existing chunking methods, such as fixed-sized windows, heuristic splitting, and syntax-based tools, often fail to capture meaningful functional units, limiting the efficacy of retrieval and other downstream LLM driven tasks. In this paper, we address the problem of chunking in C-related languages. First, we define a set of code chunk categories. Second, we train an LLM-based classifier to a) identify chunk boundaries, and b) assign each chunk a descriptive functional attribute (a category), which can be useful for downstream tasks. By leveraging the LLM's ability to capture semantic context within the code, we assume flexible chunk boundaries, allowing to adapt to the specific structure and context of each instance. Third, we introduce SemChunk-C, a family of lightweight language models for semantic chunking of C-related files (.c, .cpp, .h, .cs, etc.). These models are based on the first four Ettin encoders [1] with 17M, 32M, 68M, and 150M parameters. Despite their relatively small size, they are capable of identifying cohesive code units, such as data structures, interface blocks, and other components. Furthermore, we demonstrate the robustness of our approach on real-world code, including challenging constructs such as nested definitions and macros. We test our approach on various datasets, and show that it achieves high boundary accuracy and semantic coherence, matching or outperforming chunkers that are based on much larger code-oriented LLMs. We also validate the improved performance of the downstream tasks on a few curated benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SemChunk-C defines chunk categories and distills small Ettin models for C semantic segmentation, but the abstract supplies no numbers or validation details to support the performance claims.

read the letter

The new element here is the specific family of 17M–150M Ettin-based models trained for C-family code that both locate flexible boundaries and assign functional categories like data structures or interface blocks. The authors also lay out a short list of chunk categories and note that they first use a larger LLM to generate the labels before distillation.

That approach targets a real pain point: fixed windows and syntax parsers often break C code at unhelpful places because of macros and irregular nesting. The idea of letting the labeler adapt boundaries to context is reasonable on paper.

The abstract states that the models achieve high boundary accuracy, semantic coherence, and match or beat much larger code LLMs on datasets and downstream tasks. None of those results appear—no accuracies, no dataset sizes, no training procedure, no error bars. The central claim therefore sits on unshown evidence.

The label-generation step receives no discussion of quality checks or inter-annotator agreement either. If the LLM labels were produced with dataset-specific prompting or post-hoc fixes, the reported gains for both the teacher and the distilled models become difficult to interpret or compare.

This work is aimed at people who build retrieval or context-assembly pipelines for code LLMs. A reader might borrow the category definitions, but the missing numbers and validation details make the paper hard to use or cite as is.

I would not send it to peer review in its current form. The authors need to add the quantitative results and address how the labels were validated before a referee could evaluate whether the small models actually deliver what is claimed.

Referee Report

3 major / 0 minor

Summary. The manuscript introduces SemChunk-C, a family of small encoder-based models (17M–150M parameters, derived from the first four Ettin encoders) for semantic segmentation of C-family code (.c, .cpp, .h, .cs, etc.). It defines a set of functional chunk categories, trains an LLM classifier to detect flexible boundaries and assign categories, distills the capability into the lightweight models, and claims these models achieve high boundary accuracy and semantic coherence while matching or outperforming chunkers based on much larger code-oriented LLMs across various datasets and downstream tasks.

Significance. If the performance claims hold with proper empirical support, the work would offer a practical contribution to code retrieval and LLM-driven tasks by supplying deployable, low-parameter models that capture semantic units more effectively than fixed-window or heuristic baselines. The explicit use of functional categories is a constructive element that could aid structured downstream applications.

major comments (3)

[Abstract] Abstract: the central claim that SemChunk-C models 'achieve high boundary accuracy and semantic coherence, matching or outperforming chunkers that are based on much larger code-oriented LLMs' is stated without any accompanying numerical results, metrics, datasets, error bars, or baseline comparisons, rendering the headline performance assertion unevaluable.
[LLM classifier description] Section describing the LLM-based classifier: no details are supplied on label generation quality (e.g., validation against human annotations, inter-annotator agreement, or prompting/tuning procedures); because the distilled models' reported accuracies depend directly on the fidelity of these labels, the absence of this information prevents assessment of whether the comparisons to larger-LLM baselines are valid.
[Experimental evaluation] Experimental evaluation (implied in the abstract's reference to 'various datasets' and 'curated benchmarks'): the manuscript provides no information on training procedures, hyperparameters, dataset statistics, or statistical tests for either the classifier or the SemChunk-C models, which is load-bearing for the outperformance claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where additional empirical details are needed to support the claims. We address each point below and will revise the manuscript to incorporate the requested information.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that SemChunk-C models 'achieve high boundary accuracy and semantic coherence, matching or outperforming chunkers that are based on much larger code-oriented LLMs' is stated without any accompanying numerical results, metrics, datasets, error bars, or baseline comparisons, rendering the headline performance assertion unevaluable.

Authors: We agree that the abstract would be strengthened by including concrete metrics. In the revised version we will add key numerical results (e.g., boundary F1 scores, semantic coherence metrics), the primary datasets, and direct comparisons to the larger-LLM baselines so that the performance claim can be evaluated directly from the abstract. revision: yes
Referee: [LLM classifier description] Section describing the LLM-based classifier: no details are supplied on label generation quality (e.g., validation against human annotations, inter-annotator agreement, or prompting/tuning procedures); because the distilled models' reported accuracies depend directly on the fidelity of these labels, the absence of this information prevents assessment of whether the comparisons to larger-LLM baselines are valid.

Authors: We acknowledge that details on label quality are essential for assessing the validity of the distillation. We will expand the LLM-classifier section to describe the prompting and tuning procedures used, any human validation performed, and inter-annotator agreement statistics where available. If human validation was limited, we will explicitly note this and discuss its implications for the reported results. revision: yes
Referee: [Experimental evaluation] Experimental evaluation (implied in the abstract's reference to 'various datasets' and 'curated benchmarks'): the manuscript provides no information on training procedures, hyperparameters, dataset statistics, or statistical tests for either the classifier or the SemChunk-C models, which is load-bearing for the outperformance claim.

Authors: We agree that the experimental section requires substantially more detail to support the outperformance claims. In the revision we will add full descriptions of training procedures and hyperparameters for both the LLM classifier and the SemChunk-C models, dataset statistics (sizes, splits, sources), and any statistical tests used to compare against baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's chain consists of defining chunk categories, training an LLM-based classifier for boundaries and labels, distilling into Ettin-based encoders, and evaluating boundary accuracy on datasets against larger LLM baselines. No quoted step reduces a claimed prediction or result to its own fitted inputs by construction, nor does any load-bearing premise collapse to a self-citation whose validity is internal to the paper. The central performance claims rest on external dataset testing rather than tautological re-use of the same quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unverified assumption that the LLM classifier generalizes across real-world C constructs including macros and nesting; no free parameters, axioms, or invented entities are explicitly introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5852 in / 1093 out tokens · 17351 ms · 2026-06-30T22:34:16.040265+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 10 canonical work pages · 3 internal anchors

[1]

Ettin: Analyzing encoders vs decoders using the same architecture and data.International Conference on Learning Representations (ICLR), 2025

Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, and Benjamin Van Durme. Ettin: Analyzing encoders vs decoders using the same architecture and data.International Conference on Learning Representations (ICLR), 2025

2025
[2]

Marti A. Hearst. TextTiling: Segmenting text into multi-paragraph subtopic passages.Computational Linguistics, 23(1):33–64, 1997

1997
[3]

Text Segmentation based on Semantic Word Embeddings

Alexander A. Alemi and Paul Ginsparg. Text segmentation based on semantic word embeddings.arXiv preprint arXiv:1503.05543, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[4]

Text segmentation as a supervised learning task

Omri Koshorek, Adir Cohen, Noam Mor, Michael Rotman, and Jonathan Berant. Text segmentation as a supervised learning task. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 469–473, 2018

2018
[5]

Dense X retrieval: What retrieval unit should we use?arXiv preprint arXiv:2312.06648, 2023

Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Dense X retrieval: What retrieval unit should we use?arXiv preprint arXiv:2312.06648, 2023

work page arXiv 2023
[6]

Seven failure points when engineering a retrieval augmented generation system.arXiv preprint arXiv:2401.05856, 2024

Scott Barnett, Stefanus Kurniawan, Srikanth Thudumu, Zach Brannelly, and Mohamed Abdelrazek. Seven failure points when engineering a retrieval augmented generation system.arXiv preprint arXiv:2401.05856, 2024

work page arXiv 2024
[7]

Semantic source code models using identifier embeddings.IEEE Access, 7:129364–129377, 2019

Vasiliki Efstathiou and Diomidis Spinellis. Semantic source code models using identifier embeddings.IEEE Access, 7:129364–129377, 2019

2019
[8]

GraphCodeBERT: Pre-training Code Representations with Data Flow

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. GraphCodeBERT: Pre-training code representation with data flow.arXiv preprint arXiv:2009.08366, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[9]

Sheng Zhang, Yifan Ding, Shuquan Lian, Shun Song, and Hui Li. CodeRAG: Finding relevant and necessary knowledge for retrieval- augmented repository-level code completion.Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2025

2025
[10]

CodeWisp: AST guided retrieval augmented generation for code generation and completion

Hamza El Atrassi, Yasmina El Idrissi, and Yahya Benkaouz. CodeWisp: AST guided retrieval augmented generation for code generation and completion. In2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 2025

2025
[11]

CAST: Enhancing code retrieval-augmented generation with structural chunking via abstract syntax tree.Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

Yilin Zhang, Xinran Zhao, Zora Zhiruo Wang, Chenyang Yang, Jiayi Wei, and Tongshuang Wu. CAST: Enhancing code retrieval-augmented generation with structural chunking via abstract syntax tree.Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

2025
[12]

CODE2JSON: Can a zero-shot llm extract code features for code RAG?International Conference on Learning Representations (ICLR), 2025

Aryan Singhal, Rajat Ghosh, Ria Mundra, Harshil Dadlani, and Debojyoti Dutta. CODE2JSON: Can a zero-shot llm extract code features for code RAG?International Conference on Learning Representations (ICLR), 2025

2025
[13]

RANGER: Repository-level agent for graph-enhanced retrieval.arXiv preprint arXiv:2509.25257, 2025

Pratik Shah, Rajat Ghosh, Aryan Singhal, and Debojyoti Dutta. RANGER: Repository-level agent for graph-enhanced retrieval.arXiv preprint arXiv:2509.25257, 2025

work page arXiv 2025
[14]

FuncVul: An effective function level vulnerability detection model using LLM and code chunk.arXiv preprint arXiv:2506.19453, 2025

Sajal Halder, Muhammad Ejaz Ahmed, and Seyit Camtepe. FuncVul: An effective function level vulnerability detection model using LLM and code chunk.arXiv preprint arXiv:2506.19453, 2025

work page arXiv 2025
[15]

LongCodeZip: Compress long context for code language models

Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. LongCodeZip: Compress long context for code language models. arXiv preprint arXiv:2510.00446, 2025

work page arXiv 2025
[16]

Scott, Anand Patel, Jacob Zimmer, Colin Diggs, Michael Doyle, Scott Rosen, Nitin Naik, Justin F

Christopher Glasz, Emily Escamilla, Eric O. Scott, Anand Patel, Jacob Zimmer, Colin Diggs, Michael Doyle, Scott Rosen, Nitin Naik, Justin F. Brunelle, Samruddhi Thaker, Parthav Poudel, Arun Sridharan, Amit Madan, Doug Wendt, William Macke, and Thomas Schill. Can LLMs replace humans during code chunking?arXiv preprint arXiv:2506.19897, 2025. [17]https://hu...

work page arXiv 2025
[17]

Towards General Text Embeddings with Multi-stage Contrastive Learning

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Y ABLoCo: Yet another benchmark for long context code generation.arXiv preprint arXiv:2505.04406, 2025

Aidar Valeev, Roman Garaev, Vadim Lomshakov, Irina Piontkovskaya, Vladimir Ivanov, and Israel Adewuyi. Y ABLoCo: Yet another benchmark for long context code generation.arXiv preprint arXiv:2505.04406, 2025

work page arXiv 2025

[1] [1]

Ettin: Analyzing encoders vs decoders using the same architecture and data.International Conference on Learning Representations (ICLR), 2025

Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, and Benjamin Van Durme. Ettin: Analyzing encoders vs decoders using the same architecture and data.International Conference on Learning Representations (ICLR), 2025

2025

[2] [2]

Marti A. Hearst. TextTiling: Segmenting text into multi-paragraph subtopic passages.Computational Linguistics, 23(1):33–64, 1997

1997

[3] [3]

Text Segmentation based on Semantic Word Embeddings

Alexander A. Alemi and Paul Ginsparg. Text segmentation based on semantic word embeddings.arXiv preprint arXiv:1503.05543, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[4] [4]

Text segmentation as a supervised learning task

Omri Koshorek, Adir Cohen, Noam Mor, Michael Rotman, and Jonathan Berant. Text segmentation as a supervised learning task. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 469–473, 2018

2018

[5] [5]

Dense X retrieval: What retrieval unit should we use?arXiv preprint arXiv:2312.06648, 2023

Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Dense X retrieval: What retrieval unit should we use?arXiv preprint arXiv:2312.06648, 2023

work page arXiv 2023

[6] [6]

Seven failure points when engineering a retrieval augmented generation system.arXiv preprint arXiv:2401.05856, 2024

Scott Barnett, Stefanus Kurniawan, Srikanth Thudumu, Zach Brannelly, and Mohamed Abdelrazek. Seven failure points when engineering a retrieval augmented generation system.arXiv preprint arXiv:2401.05856, 2024

work page arXiv 2024

[7] [7]

Semantic source code models using identifier embeddings.IEEE Access, 7:129364–129377, 2019

Vasiliki Efstathiou and Diomidis Spinellis. Semantic source code models using identifier embeddings.IEEE Access, 7:129364–129377, 2019

2019

[8] [8]

GraphCodeBERT: Pre-training Code Representations with Data Flow

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. GraphCodeBERT: Pre-training code representation with data flow.arXiv preprint arXiv:2009.08366, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[9] [9]

Sheng Zhang, Yifan Ding, Shuquan Lian, Shun Song, and Hui Li. CodeRAG: Finding relevant and necessary knowledge for retrieval- augmented repository-level code completion.Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2025

2025

[10] [10]

CodeWisp: AST guided retrieval augmented generation for code generation and completion

Hamza El Atrassi, Yasmina El Idrissi, and Yahya Benkaouz. CodeWisp: AST guided retrieval augmented generation for code generation and completion. In2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 2025

2025

[11] [11]

CAST: Enhancing code retrieval-augmented generation with structural chunking via abstract syntax tree.Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

Yilin Zhang, Xinran Zhao, Zora Zhiruo Wang, Chenyang Yang, Jiayi Wei, and Tongshuang Wu. CAST: Enhancing code retrieval-augmented generation with structural chunking via abstract syntax tree.Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

2025

[12] [12]

CODE2JSON: Can a zero-shot llm extract code features for code RAG?International Conference on Learning Representations (ICLR), 2025

Aryan Singhal, Rajat Ghosh, Ria Mundra, Harshil Dadlani, and Debojyoti Dutta. CODE2JSON: Can a zero-shot llm extract code features for code RAG?International Conference on Learning Representations (ICLR), 2025

2025

[13] [13]

RANGER: Repository-level agent for graph-enhanced retrieval.arXiv preprint arXiv:2509.25257, 2025

Pratik Shah, Rajat Ghosh, Aryan Singhal, and Debojyoti Dutta. RANGER: Repository-level agent for graph-enhanced retrieval.arXiv preprint arXiv:2509.25257, 2025

work page arXiv 2025

[14] [14]

FuncVul: An effective function level vulnerability detection model using LLM and code chunk.arXiv preprint arXiv:2506.19453, 2025

Sajal Halder, Muhammad Ejaz Ahmed, and Seyit Camtepe. FuncVul: An effective function level vulnerability detection model using LLM and code chunk.arXiv preprint arXiv:2506.19453, 2025

work page arXiv 2025

[15] [15]

LongCodeZip: Compress long context for code language models

Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. LongCodeZip: Compress long context for code language models. arXiv preprint arXiv:2510.00446, 2025

work page arXiv 2025

[16] [16]

Scott, Anand Patel, Jacob Zimmer, Colin Diggs, Michael Doyle, Scott Rosen, Nitin Naik, Justin F

Christopher Glasz, Emily Escamilla, Eric O. Scott, Anand Patel, Jacob Zimmer, Colin Diggs, Michael Doyle, Scott Rosen, Nitin Naik, Justin F. Brunelle, Samruddhi Thaker, Parthav Poudel, Arun Sridharan, Amit Madan, Doug Wendt, William Macke, and Thomas Schill. Can LLMs replace humans during code chunking?arXiv preprint arXiv:2506.19897, 2025. [17]https://hu...

work page arXiv 2025

[17] [17]

Towards General Text Embeddings with Multi-stage Contrastive Learning

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Y ABLoCo: Yet another benchmark for long context code generation.arXiv preprint arXiv:2505.04406, 2025

Aidar Valeev, Roman Garaev, Vadim Lomshakov, Irina Piontkovskaya, Vladimir Ivanov, and Israel Adewuyi. Y ABLoCo: Yet another benchmark for long context code generation.arXiv preprint arXiv:2505.04406, 2025

work page arXiv 2025