arxiv: 2605.00199 · v2 · submitted 2026-04-30 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

Recognition: unknown

RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners

Jugal Gajjar , Kamalasankari Subramaniakuppusamy

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LG

keywords table reasoningsmall language modelsstructured attributionfaithfulnesscell-level citationsNLI rewardGRPO optimization

0 comments

The pith

Training small language models to output reasoning with built-in cell citations raises faithfulness 3.7 times over standard fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RSAT as a way to train small language models to answer table questions with step-by-step reasoning that includes explicit cell citations. It proceeds in two stages: first supervised fine-tuning on verified traces formatted as structured JSON, then optimization that rewards natural-language-inference checks for whether each step is actually supported by the cited cells. This built-in approach produces far higher faithfulness and citation validity than either plain fine-tuning or attempts to add citations after reasoning is complete. A reader would care because it points to a practical route for making model answers to data questions more verifiable without requiring much larger models.

Core claim

RSAT trains models to generate structured reasoning traces that cite specific table cells and optimizes them so that each reasoning step is faithful to the evidence in those cells. Across six models from the Qwen 2.5 and Llama 3 families, the method raises faithfulness from 0.224 to 0.826 while reaching 0.992 citation validity. Post-hoc attribution after ordinary fine-tuning succeeds in producing valid formats less than 13 percent of the time, and removing the faithfulness component of the reward drops faithfulness to 0.03.

What carries the argument

RSAT, a two-phase procedure that first uses supervised fine-tuning to teach structured JSON output containing cell citations, then applies GRPO to maximize a composite reward that includes NLI-based faithfulness, citation validity, and parsimony.

If this is right

Small models from multiple families can be made to produce table reasoning whose every step is traceable to specific evidence cells.
Attribution must be learned jointly with reasoning; retrofitting citations afterward fails to produce valid structured output.
The faithfulness reward is the dominant driver of gains; its removal collapses performance.
The same training pattern works for models ranging from 1B to 8B parameters without family-specific changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same integrated-attribution training could be tested on other tasks that require grounding in external data, such as spreadsheet formulas or database queries.
If the method generalizes, production table-question systems could expose the cited cells to users for direct verification instead of relying on model self-reports.
Future experiments could measure whether the structured format itself improves downstream accuracy on multi-hop table questions even when faithfulness rewards are held constant.

Load-bearing premise

The NLI-based faithfulness reward correctly identifies whether each reasoning step is genuinely supported by the cited table cells rather than merely overlapping with their text.

What would settle it

A side-by-side human audit that checks whether every cited cell actually entails the corresponding reasoning step in RSAT outputs, and whether the same steps would hold without those exact citations, would falsify the claim if the human scores show no improvement over baselines.

Figures

Figures reproduced from arXiv: 2605.00199 by Jugal Gajjar, Kamalasankari Subramaniakuppusamy.

**Figure 1.** Figure 1: RSAT overview. Phase 1 (SFT) teaches the structured output format using verified gold traces. Phase 2 (GRPO) generates multiple candidates per question, scores them with a composite reward, and updates the policy. The faithfulness reward is the critical signal that grounds reasoning in actual table evidence [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Scaling curves for faithfulness and answer F1. RSAT (solid, high) dominates at every scale. Qwen outperforms Llama at small scale; the gap narrows at 7–8B. Post-hoc collapses regardless of scale. tively map free-form reasoning into structured citations. All methods use the same system prompt and base model, ensuring that the only difference is whether attribution is integrated during generation or applied… view at source ↗

**Figure 3.** Figure 3: Post-hoc attribution collapses across all models. RSAT maintains >97% format success; posthoc averages 12.7% (dashed line) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: visualizes the ablation study results from Section 5, showing the effect of removing each reward component on faithfulness and parsimony. RSAT (full) Faith Pars Cite 0.0 0.2 0.4 0.6 0.8 1.0 Faithfulness (a) Faithfulness by Ablation Qwen 7B Llama 8B RSAT (full) Faith Pars Cite 0.0 0.2 0.4 0.6 0.8 1.0 Parsimony (b) Parsimony by Ablation Qwen 7B Llama 8B [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: SFT training curves. Train and validation loss over 3 epochs for all 6 models. All models converge without overfitting. GRPO reward dynamics [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: GRPO reward curves. Composite reward mean increases for all models despite near-zero loss. Smoothed with a 15-step moving average; raw values shown in light shading. G GRPO Training Statistics [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Completion length reduction during GRPO. Early GRPO completions (SFT-like behavior) vs. late GRPO completions (RSAT behavior). All models learn to produce shorter, more focused outputs. J Post-hoc Failure Analysis [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

read the original abstract

When a language model answers a table question, users have no way to verify which cells informed which reasoning steps. We introduce RSAT, a method that trains small language models (SLMs, 1-8B) to produce step-by-step reasoning with cell-level citations grounded in table evidence. Phase 1 (SFT) teaches a structured JSON output format from verified reasoning traces. Phase 2 (GRPO) optimizes a composite reward centered on NLI-based faithfulness, alongside citation validity and parsimony. Across six models from two families-Qwen 2.5 (1.5B/3B/7B) and Llama 3 (1B/3B/8B)-RSAT improves faithfulness 3.7$\times$ over SFT alone (0.224$\rightarrow$0.826), with near-perfect citation validity (0.992). Post-hoc attribution collapses below 13% format success, confirming that attribution must be integrated into reasoning, not retrofitted. Ablations show the faithfulness reward is essential: removing it drops faithfulness from 0.97 to 0.03.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces RSAT, a two-phase training method for small language models (1-8B parameters from Qwen 2.5 and Llama 3 families) to produce step-by-step reasoning with cell-level citations for table question answering. Phase 1 applies SFT on verified reasoning traces to teach a structured JSON output format; Phase 2 uses GRPO to optimize a composite reward centered on NLI-based faithfulness, citation validity, and parsimony. The central claims are a 3.7× faithfulness improvement over SFT alone (0.224→0.826), near-perfect citation validity (0.992), failure of post-hoc attribution (<13% format success), and the necessity of the faithfulness reward (ablation drop from 0.97 to 0.03).

Significance. If the results hold, the work demonstrates that integrating structured attribution into the training process via SFT+GRPO can substantially improve faithfulness and interpretability for table reasoning in SLMs, outperforming post-hoc methods. The cross-family consistency and reward ablation provide useful evidence on training objectives for grounded generation, with potential applications in verifiable structured-data tasks.

major comments (3)

The evaluation protocol is insufficiently specified: the abstract and results report faithfulness gains (0.224→0.826) and ablation values (0.97→0.03) without dataset details, error bars, full metrics definitions, or statistical tests. This undermines verification of the central claim of consistent 3.7× gains across six models.
The NLI-based faithfulness reward (Phase 2) is load-bearing for the reported improvements, yet the manuscript provides no validation that it measures genuine cell-level evidence grounding rather than superficial textual overlap or plausible reasoning steps. Without targeted analysis or human evaluation of whether cited cells are actually consulted, the SFT+GRPO gains and post-hoc collapse could reflect format learning instead.
Phase 1 relies on verified reasoning traces for SFT, but no details are given on the verification process, how cell-level accuracy was ensured, or checks for systematic biases in those traces. Any such bias would propagate through the GRPO stage and affect the faithfulness metric.

minor comments (2)

The abstract states 'near-perfect citation validity (0.992)' without defining the precise criteria or computation method for citation validity.
Clarify the apparent discrepancy between the main faithfulness score (0.826) and the ablation reference value (0.97); they may reflect different evaluation conditions or metrics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: The evaluation protocol is insufficiently specified: the abstract and results report faithfulness gains (0.224→0.826) and ablation values (0.97→0.03) without dataset details, error bars, full metrics definitions, or statistical tests. This undermines verification of the central claim of consistent 3.7× gains across six models.

Authors: We agree that the evaluation protocol requires more detailed specification to allow full verification of our claims. In the revised manuscript, we will add comprehensive details on the datasets used, including their sources, sizes, and splits. We will provide precise mathematical definitions for all metrics, including the faithfulness score based on NLI. Additionally, we will report error bars (standard deviations over multiple seeds) and conduct statistical tests (e.g., Wilcoxon signed-rank tests) to confirm the significance of the 3.7× improvement across the six models from both families. These additions will be placed in the Experiments and Appendix sections. revision: yes
Referee: The NLI-based faithfulness reward (Phase 2) is load-bearing for the reported improvements, yet the manuscript provides no validation that it measures genuine cell-level evidence grounding rather than superficial textual overlap or plausible reasoning steps. Without targeted analysis or human evaluation of whether cited cells are actually consulted, the SFT+GRPO gains and post-hoc collapse could reflect format learning instead.

Authors: This is a valid concern regarding the validity of our NLI-based reward. While the reward combines NLI entailment with citation validity to encourage grounding, we did not include explicit validation against human judgments in the original submission. In the revision, we will incorporate a targeted analysis: we will sample outputs from the model and have them annotated by humans for whether the cited cells are indeed the ones used in the reasoning steps (as opposed to superficial matches). We will report the correlation between the automated NLI faithfulness score and human assessments. This will help demonstrate that the improvements are due to genuine attribution rather than mere format adherence. We will also discuss potential limitations of NLI in this context. revision: yes
Referee: Phase 1 relies on verified reasoning traces for SFT, but no details are given on the verification process, how cell-level accuracy was ensured, or checks for systematic biases in those traces. Any such bias would propagate through the GRPO stage and affect the faithfulness metric.

Authors: We acknowledge the need for transparency on the trace verification process. In the revised version, we will expand the Method section to describe in detail how the verified reasoning traces were created and validated. This includes: the protocol for ensuring cell-level accuracy (e.g., cross-checking each step against the table cells), the steps taken to avoid systematic biases (such as using diverse table domains and question types, and balancing positive/negative examples), and any automated or manual verification procedures employed. We will also discuss how this verification helps prevent bias propagation into the GRPO phase. revision: yes

Circularity Check

0 steps flagged

No circularity: training pipeline and metrics rely on external components

full rationale

The RSAT method consists of Phase-1 SFT on verified reasoning traces followed by Phase-2 GRPO optimization using a composite reward that includes an independent NLI model for faithfulness scoring against table content, plus citation validity and parsimony terms. All reported gains (e.g., 0.224 to 0.826 faithfulness) are measured against separate baselines such as plain SFT and post-hoc attribution methods; no result is obtained by fitting a parameter to a subset of the evaluation data and then relabeling it as a prediction. No equations are self-referential, no uniqueness theorems are imported from the authors' prior work, and no ansatz or renaming of known results occurs. The derivation chain is therefore self-contained against external benchmarks and does not reduce to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that NLI models provide a faithful proxy for reasoning grounding and that high-quality verified traces exist for SFT; no new physical entities are introduced.

free parameters (1)

composite reward weights
The relative weighting of faithfulness, citation validity, and parsimony terms is not specified and must be chosen or tuned.

axioms (2)

domain assumption NLI models reliably detect whether reasoning steps are supported by cited table cells
This is the core of the Phase 2 faithfulness reward.
domain assumption Verified reasoning traces used in SFT are accurate and unbiased
Phase 1 depends on the quality of these traces.

pith-pipeline@v0.9.0 · 5509 in / 1300 out tokens · 54532 ms · 2026-05-09T20:23:42.085744+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 14 canonical work pages · 7 internal anchors

[1]

Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

TaPas: Weakly supervised table parsing via pre-training , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
[2]

arXiv preprint arXiv:2107.07653 , year=

TAPEX: Table pre-training via learning a neural SQL executor , author=. arXiv preprint arXiv:2107.07653 , year=

work page arXiv
[3]

Compositional semantic parsing on semi-structured tables , author=. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=
[4]

A., and Yu, T

Binding language models in symbolic languages , author=. arXiv preprint arXiv:2210.02875 , year=

work page arXiv
[5]

Chain-of-table: Evolving tables in the reasoning chain for table understanding.arXiv preprint arXiv:2401.04398, 2024

Chain-of-table: Evolving tables in the reasoning chain for table understanding , author=. arXiv preprint arXiv:2401.04398 , year=

work page arXiv
[6]

Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

Understanding tables with intermediate pre-training , author=. Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

2020
[7]

Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

TaBERT: Pretraining for joint understanding of textual and tabular data , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
[8]

Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

Seq2sql: Generating structured queries from natural language using reinforcement learning , author=. arXiv preprint arXiv:1709.00103 , year=

work page internal anchor Pith review arXiv
[9]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

TaPERA: Enhancing faithfulness and interpretability in long-form table QA by content planning and execution-based reasoning , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[10]

ArXivabs/1909.02164(2019),https://api.semanticscholar.org/CorpusID: 1989173392

Tabfact: A large-scale dataset for table-based fact verification , author=. arXiv preprint arXiv:1909.02164 , year=

work page arXiv 1909
[11]

Proceedings of the 2022 conference on empirical methods in natural language processing , pages=

PASTA: table-operations aware fact verification via sentence-table cloze pre-training , author=. Proceedings of the 2022 conference on empirical methods in natural language processing , pages=

2022
[12]

Joint verification and reranking for open fact checking over tables , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=
[13]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Program enhanced fact verification with verbalization and graph attention network , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

2020
[14]

The Twelfth International Conference on Learning Representations , year=

Self-rag: Learning to retrieve, generate, and critique through self-reflection , author=. The Twelfth International Conference on Learning Representations , year=
[15]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Enabling large language models to generate text with citations , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[16]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Rarr: Researching and revising what language models say, using language models , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[17]

Computational Linguistics , volume=

Measuring attribution in natural language generation models , author=. Computational Linguistics , volume=
[18]

arXiv preprint arXiv:2212.08037 , year=

Attributed question answering: Evaluation and modeling for attributed large language models , author=. arXiv preprint arXiv:2212.08037 , year=

work page arXiv
[19]

Wallat, M

Correctness is not faithfulness in rag attributions , author=. arXiv preprint arXiv:2412.18004 , year=

work page arXiv
[20]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[21]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[22]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[25]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

arXiv preprint arXiv:2503.16219

Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't , author=. arXiv preprint arXiv:2503.16219 , year=

work page arXiv
[27]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=
[28]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

2019
[31]

Transactions of the Association for Computational Linguistics , volume=

FeTaQA: Free-form table question answering , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

2022
[32]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Deberta: Decoding-enhanced bert with disentangled attention , author=. arXiv preprint arXiv:2006.03654 , year=

work page internal anchor Pith review arXiv 2006