ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold

Chenlang Yi; Gang Li; My T. Thai; Tianbao Yang; Tue Minh Cao; Yanmin Gong; Zizhan Xiong

arxiv: 2604.13392 · v2 · pith:3DKCT2U7new · submitted 2026-04-15 · 💻 cs.AI

ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold

Chenlang Yi , Gang Li , Zizhan Xiong , Tue Minh Cao , Yanmin Gong , My T. Thai , Tianbao Yang This is my paper

Pith reviewed 2026-05-21 00:55 UTC · model grok-4.3

classification 💻 cs.AI

keywords tabular data predictionsymbolic scaffoldsdecision treesLLM fine-tuningfaithful reasoninghallucination metricsexplainable AImedical and financial benchmarks

0 comments

The pith

ReSS trains LLMs on decision-tree paths to gain up to 10 percent accuracy on tabular data while keeping reasoning faithful to the tree logic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ReSS extracts decision paths from a tree model for each data row and uses those paths as symbolic scaffolds. The scaffolds steer an LLM to write natural-language reasoning that follows the exact logic of the tree. The resulting dataset fine-tunes a pretrained LLM into a tabular reasoning model, with an added augmentation step that preserves the scaffold. New metrics measure hallucination rate, explanation necessity, and explanation sufficiency. Experiments on medical and financial tables show the trained models outperform both plain decision trees and standard fine-tuning by as much as 10 percent while producing consistent explanations.

Core claim

The central claim is that decision-tree paths can serve as reliable scaffolds to generate grounded natural-language reasoning from an LLM; the resulting dataset, when used for fine-tuning together with scaffold-invariant augmentation, produces models that achieve higher predictive accuracy on tabular tasks and satisfy quantitative faithfulness criteria defined by hallucination, necessity, and sufficiency scores.

What carries the argument

Instance-level decision paths extracted from a decision tree, used as symbolic scaffolds that constrain and ground the LLM's natural-language reasoning generation.

If this is right

ReSS-trained models deliver both higher accuracy and measurable faithfulness on tabular prediction in healthcare and finance.
The scaffold-guided dataset creation reduces the need for manual reasoning annotations while preserving logical consistency.
Scaffold-invariant augmentation improves generalization without altering the core decision structure.
The three quantitative metrics provide an objective way to audit whether explanations remain faithful to the underlying logic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar scaffolding could be built from other symbolic structures such as rule sets or logic programs to handle a wider range of reasoning tasks.
The same pipeline might be applied to larger or streaming tabular datasets where manual explanation curation is impractical.
Hybrid systems could route simple cases to the tree and complex cases to the fine-tuned LLM while using the scaffold for consistency checks.

Load-bearing premise

The LLM-generated reasoning will strictly follow the decision logic in the scaffold without adding contradictions or extraneous inferences that the hallucination, necessity, and sufficiency metrics will reliably detect.

What would settle it

A controlled test in which a substantial fraction of generated explanations on held-out data either contradict the corresponding tree path or receive low scores on the necessity or sufficiency metrics.

Figures

Figures reproduced from arXiv: 2604.13392 by Chenlang Yi, Gang Li, My T. Thai, Tianbao Yang, Tue Minh Cao, Yanmin Gong, Zizhan Xiong.

**Figure 1.** Figure 1: An illustration of the ReSS pipeline applied to the diabetes prediction problem. 3.2. Using Decision Tree Paths as Symbolic Scaffolds Motivation of Using Symbolic Scaffolds. A consequence of the direct curation approach discussed above is that the generated rationale may contain many non-useful features. An example given in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Explanation sufficiency and necessity analysis for ReSS via feature masking across four tabular datasets, averaged over three random seeds. The x-axis denotes the number of masked features per instance, while the y-axis shows the resulting change in prediction accuracy under masking interventions [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: An example of step-by-step reasoning curated by ReSS on AD dataset. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗

**Figure 4.** Figure 4: An example of step-by-step reasoning curated by ReSS on Creditg dataset. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗

**Figure 5.** Figure 5: An example of step-by-step reasoning curated by ReSS on Diabetes dataset. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

**Figure 6.** Figure 6: An example of step-by-step reasoning curated by ReSS on Homeloan dataset. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 7.** Figure 7: An example of step-by-step reasoning obtained by direct reasoning curation on Alzheimer’s Disease dataset. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

**Figure 8.** Figure 8: An example of step-by-step reasoning obtained by direct reasoning curation on Creditg dataset. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗

**Figure 9.** Figure 9: An example of step-by-step reasoning obtained by direct reasoning curation on Diabetes dataset. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗

**Figure 10.** Figure 10: An example of step-by-step reasoning obtained by direct reasoning curation on Homeloan dataset. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗

**Figure 11.** Figure 11: An example of delexicalized step-by-step reasoning curated by ReSS on the Diabetes dataset. C. Hyperparameters C.1. Decision Tree For the Decision Tree baseline, we perform grid search over the following hyperparameter space: • max depth ∈ {4, 5, 6, 7} • min samples split ∈ {2, 5, 10, 20} • min samples leaf ∈ {1, 2, 5, 10} • criterion ∈ {gini, entropy} The optimal hyperparameters are selected based on val… view at source ↗

**Figure 12.** Figure 12: Along this decision path, the decision tree assigns a non-diabetic label, which reflects the empirical training label distribution observed in this localized region of the feature space but is wrong. However, every condition along the path corresponds to a well-established risk factor for diabetes according to the domain knowledge. In contrast, our fine-tuned LLM generates a rationale that faithfully foll… view at source ↗

**Figure 13.** Figure 13: Ablation study with delexicalized features, conducted without augmented reasoning data. Results are averaged over three random seeds. E.3. Ablation Studies on Conducting RL after ReSS The results are shown in [PITH_FULL_IMAGE:figures/full_fig_p037_13.png] view at source ↗

**Figure 14.** Figure 14: Direct RL vs. ReSS (w/o aug.) + RL [PITH_FULL_IMAGE:figures/full_fig_p037_14.png] view at source ↗

**Figure 16.** Figure 16: shows the sufficiency and necessity curves on Diabetes and AD. For ReSS, masking unused features results in only minor accuracy changes, while masking explanation-referenced features leads to a sharp and monotonic performance drop, indicating strong explanation necessity. In contrast, DRC+SFT consistently exhibits substantially weaker necessity. On Diabetes, masking features referenced by the explanation … view at source ↗

read the original abstract

Tabular data remains prevalent in high-stakes domains such as healthcare and finance, where predictive models are expected to provide both high accuracy and faithful, human-understandable reasoning. While symbolic models offer verifiable logic, they lack semantic expressiveness. Meanwhile, general-purpose LLMs often require specialized fine-tuning to master domain-specific tabular reasoning. To address the dual challenges of scalable data curation and reasoning consistency, we propose ReSS, a systematic framework that bridges symbolic and neural reasoning models. ReSS leverages a decision-tree model to extract instance-level decision paths as symbolic scaffolds. These scaffolds, alongside input features and labels, guide an LLM to generate grounded natural-language reasoning that strictly adheres to the underlying decision logic. The resulting high-quality dataset is used to fine-tune a pretrained LLM into a specialized tabular reasoning model, further enhanced by a scaffold-invariant data augmentation strategy to improve generalization and explainability. To rigorously assess faithfulness, we introduce quantitative metrics including hallucination rate, explanation necessity, and explanation sufficiency. Experimental results on medical and financial benchmarks demonstrate that ReSS-trained models improve traditional decision trees and standard fine-tuning approaches up to $10\%$ while producing faithful and consistent reasoning

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReSS uses tree-derived paths as scaffolds to guide LLM reasoning generation for tabular data then fine-tunes on the result, but the reported gains and faithfulness rest on metrics whose reliability is not yet clear from the details.

read the letter

The main point is that ReSS extracts decision paths from a tree model and feeds them as scaffolds to an LLM so the generated natural-language reasoning stays tied to the tree logic, then uses that data plus scaffold-invariant augmentation to fine-tune a tabular reasoning model. The faithfulness metrics (hallucination rate, necessity, sufficiency) are meant to quantify how well this works on medical and financial benchmarks, with claims of up to 10% gains over plain trees and standard fine-tuning.

Referee Report

2 major / 1 minor

Summary. The paper presents ReSS, a framework for tabular data prediction that uses decision trees to generate symbolic scaffolds for guiding LLMs to produce natural-language reasoning. The scaffolds are used to create training data for fine-tuning LLMs, with an additional scaffold-invariant data augmentation. New metrics for faithfulness are proposed, and the method is claimed to achieve up to 10% better performance on medical and financial benchmarks compared to decision trees and standard fine-tuning while maintaining faithful reasoning.

Significance. This work addresses an important problem in explainable AI for tabular data in high-stakes applications. If the results hold, it offers a promising way to integrate symbolic reasoning with neural models for better accuracy and interpretability. The quantitative faithfulness metrics are a notable contribution for evaluating such hybrid systems.

major comments (2)

The abstract states that ReSS-trained models improve upon traditional decision trees and standard fine-tuning approaches up to 10%, but does not specify the exact experimental protocol, baseline details, statistical tests, or ablation results. This leaves the central performance and faithfulness claims without verifiable support.
The hallucination rate, explanation necessity, and explanation sufficiency metrics are introduced to measure adherence to the decision logic in the scaffolds. However, without an independent verification such as human evaluation to confirm they capture deviations or contradictions, the metrics risk being circular and not guaranteeing the strict adherence assumed in the central claim.

minor comments (1)

Consider adding the names of the specific medical and financial benchmarks used in the experiments for better context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address the major comments point by point below, providing clarifications and indicating planned revisions to enhance the manuscript's transparency and rigor.

read point-by-point responses

Referee: The abstract states that ReSS-trained models improve upon traditional decision trees and standard fine-tuning approaches up to 10%, but does not specify the exact experimental protocol, baseline details, statistical tests, or ablation results. This leaves the central performance and faithfulness claims without verifiable support.

Authors: The abstract is intentionally concise to highlight the main contributions. Detailed descriptions of the experimental protocol, including the specific medical and financial benchmarks, baseline models (decision trees and vanilla LLM fine-tuning), statistical significance testing, and ablation studies on scaffold and augmentation components, are provided in Section 4 of the manuscript with results in the tables. We will update the abstract to include a brief reference to the evaluation setup and cross-references to the experimental section to improve immediate verifiability. revision: yes
Referee: The hallucination rate, explanation necessity, and explanation sufficiency metrics are introduced to measure adherence to the decision logic in the scaffolds. However, without an independent verification such as human evaluation to confirm they capture deviations or contradictions, the metrics risk being circular and not guaranteeing the strict adherence assumed in the central claim.

Authors: We appreciate the concern regarding potential circularity. The metrics are defined through direct, objective computations: hallucination rate detects logical contradictions via entailment against the scaffold, necessity quantifies performance drop upon explanation removal, and sufficiency checks predictive power of the explanation in isolation. These are rule-based and independent of the LLM generation process itself. To further strengthen validation, we will incorporate a human evaluation on a sample subset in the revised manuscript to demonstrate correlation with the automated scores. revision: yes

Circularity Check

0 steps flagged

Independent decision-tree scaffolds generated prior to LLM involvement keep derivation self-contained

full rationale

The paper first fits a decision-tree model to extract instance-level decision paths as symbolic scaffolds from the tabular data. These scaffolds are then used to prompt an LLM for natural-language reasoning generation, followed by fine-tuning and evaluation via hallucination rate, necessity, and sufficiency metrics. Because the tree-derived scaffolds are produced by an independent symbolic model before any LLM step and the metrics operate on the generated outputs rather than re-using the same fitted parameters, no load-bearing step reduces by construction to its own inputs. The central claims rest on experimental comparisons rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the premise that decision paths extracted from a trained tree constitute a complete and faithful symbolic representation of the predictive logic that an LLM can be made to follow without semantic loss.

axioms (1)

domain assumption A decision-tree model trained on tabular data yields instance-level decision paths that accurately encode the logic used to reach each prediction.
This premise is invoked when the paths are extracted and supplied to the LLM as scaffolds.

pith-pipeline@v0.9.0 · 5752 in / 1419 out tokens · 41624 ms · 2026-05-21T00:55:36.605000+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ReSS leverages a decision-tree model to extract instance-level decision paths as symbolic scaffolds. These scaffolds, alongside input features and labels, guide an LLM to generate grounded natural-language reasoning that strictly adheres to the underlying decision logic.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce quantitative metrics including hallucination rate, explanation necessity, and explanation sufficiency.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 7 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Dnf-net: A neural architecture for tabular data, 2020

Abutbul, A., Elidan, G., Katzir, L., and El-Yaniv, R. Dnf-net: A neural architecture for tabular data, 2020. URL https://arxiv.org/abs/2006.06465

work page arXiv 2020
[3]

Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint:2503.08679, 2025

Arcuschin, I., Janiak, J., Krzyzanowski, R., Rajamanoharan, S., Nanda, N., and Conmy, A. Chain-of-thought reasoning in the wild is not always faithful, 2025. URL https://arxiv. org/abs/2503.08679, 2025

work page arXiv 2025
[4]

Arik, S. O. and Pfister, T. Tabnet: Attentive interpretable tabular learning, 2020. URL https://arxiv.org/abs/1908.07442

work page arXiv 2020
[5]

G., and Augenstein, I

Atanasova, P., Camburu, O.-M., Lioma, C., Lukasiewicz, T., Simonsen, J. G., and Augenstein, I. Faithfulness tests for natural language explanations. arXiv preprint arXiv:2305.18029, 2023

work page arXiv 2023
[6]

Chain-of-thought is not explainability

Barez, F., Wu, T.-Y., Arcuschin, I., Lan, M., Wang, V., Siegel, N., Collignon, N., Neo, C., Lee, I., Paren, A., et al. Chain-of-thought is not explainability. Preprint, alphaXiv, pp.\ v1, 2025

work page 2025
[7]

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., Pieler, M., Prashanth, U. S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S. Gpt-neox-20b: An open-source autoregressive language model, 2022. URL https://arxiv.org/abs/2204.06745

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Machine Learning 45(1), 5–32 (Oct 2001)

Breiman, L. Random forests. Mach. Learn., 45 0 (1): 0 5–32, October 2001. ISSN 0885-6125. doi:10.1023/A:1010933404324. URL https://doi.org/10.1023/A:1010933404324

work page doi:10.1023/a:1010933404324 2001
[9]

H., Olshen, R

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification and Regression Trees. Wadsworth, 1984. ISBN 0-534-98053-8

work page 1984
[10]

Tabr1: Taming grpo for tabular reasoning llms

Cai, P., Gao, Z., and Chen, J. Tabr1: Taming grpo for tabular reasoning llms. arXiv preprint arXiv:2510.17385, 2025

work page arXiv 2025
[11]

Chen and C

Chen, Tianqi, Guestrin, and Carlos. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp.\ 785–794. ACM, August 2016. doi:10.1145/2939672.2939785. URL http://dx.doi.org/10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016
[12]

Scaling Instruction-Finetuned Language Models

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson, K., Valter, D., Narang, S., Mishra, G., Yu, A., Zhao, V., Huang, Y., Dai, A., Yu, H., Petrov, S., Chi, E. H., Dean, J., Devlin, J., Roberts,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Diabetes 130-US Hospitals for Years 1999-2008

Clore, J., Cios, K., DeShazo, J., and Strack, B. Diabetes 130-US Hospitals for Years 1999-2008 . UCI Machine Learning Repository, 2014. DOI : 10.24432/C5230J

work page doi:10.24432/c5230j 1999
[14]

Lift: Language-interfaced fine-tuning for non-language machine learning tasks, 2022

Dinh, T., Zeng, Y., Zhang, R., Lin, Z., Gira, M., Rajput, S., yong Sohn, J., Papailiopoulos, D., and Lee, K. Lift: Language-interfaced fine-tuning for non-language machine learning tasks, 2022. URL https://arxiv.org/abs/2206.06565

work page arXiv 2022
[15]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Tabllm: Few-shot classification of tabular data with large language models, 2023

Hegselmann, S., Buendia, A., Lang, H., Agrawal, M., Jiang, X., and Sontag, D. Tabllm: Few-shot classification of tabular data with large language models, 2023. URL https://arxiv.org/abs/2210.10723

work page arXiv 2023
[17]

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

Hollmann, N., Müller, S., Eggensperger, K., and Hutter, F. Tabpfn: A transformer that solves small tabular classification problems in a second, 2023. URL https://arxiv.org/abs/2207.01848

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Hu, J., Liu, J. K., Xu, H., and Shen, W. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization, 2025. URL https://arxiv.org/abs/2501.03262

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

TabTransformer: Tabular Data Modeling Using Contextual Embeddings

Huang, X., Khetan, A., Cvitkovic, M., and Karnin, Z. Tabtransformer: Tabular data modeling using contextual embeddings, 2020. URL https://arxiv.org/abs/2012.06678

work page internal anchor Pith review Pith/arXiv arXiv 2020
[20]

(how) do reasoning models reason? Annals of the New York Academy of Sciences, 1547 0 (1): 0 33--40, 2025

Kambhampati, S., Stechly, K., and Valmeekam, K. (how) do reasoning models reason? Annals of the New York Academy of Sciences, 1547 0 (1): 0 33--40, 2025. doi:https://doi.org/10.1111/nyas.15339. URL https://nyaspubs.onlinelibrary.wiley.com/doi/abs/10.1111/nyas.15339

work page doi:10.1111/nyas.15339 2025
[21]

Lightgbm: A highly efficient gradient boosting decision tree

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://...

work page 2017
[22]

T., Kang, D., Moon, S., Lee, J

Kwon, T., iunn Ong, K. T., Kang, D., Moon, S., Lee, J. R., Hwang, D., Sim, Y., Sohn, B., Lee, D., and Yeo, J. Large language models are clinical reasoners: Reasoning-aware diagnosis framework with prompt-generated rationales, 2024. URL https://arxiv.org/abs/2312.07399

work page arXiv 2024
[23]

Crafting papers on machine learning

Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.\ 1207--1216, Stanford, CA, 2000. Morgan Kaufmann

work page 2000
[24]

Disco: Re- inforcing large reasoning models with discriminative con- strained optimization.arXiv preprint arXiv:2505.12366,

Li, G., Lin, M., Galanti, T., Tu, Z., and Yang, T. Disco: Reinforcing large reasoning models with discriminative constrained optimization. arXiv preprint arXiv:2505.12366, 2025

work page arXiv 2025
[25]

MANCUR OLSON

Moro, S., Rita, P., and Cortez, P. Bank Marketing . UCI Machine Learning Repository, 2014. DOI : 10.24432/C5K306

work page doi:10.24432/c5k306 2014
[26]

Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning

Paul, D., West, R., Bosselut, A., and Faltings, B. Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. arXiv preprint arXiv:2402.13950, 2024

work page arXiv 2024
[27]

Y., Cooper, M., and Krishnan, R

Si, J., Cheng, W. Y., Cooper, M., and Krishnan, R. G. Interpretabnet: Distilling predictive signals from tabular data by salient feature interpretation. arXiv preprint arXiv:2406.00426, 2024

work page arXiv 2024
[28]

and Singh, S

Slack, D. and Singh, S. Tablet: Learning from instructions for tabular data, 2023. URL https://arxiv.org/abs/2304.13188

work page arXiv 2023
[29]

Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, C., Xiao, C., Du, C., Liao, C., et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting

Turpin, M., Michael, J., Perez, E., and Bowman, S. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36: 0 74952--74965, 2023

work page 2023
[31]

Trl: Transformer reinforcement learning

von Werra, L., Belkada, Y., Tunstall, L., Beeching, E., Thrush, T., Lambert, N., Huang, S., Rasul, K., and Gallouédec, Q. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020

work page 2020
[32]

Vygotsky, L. S. Mind in Society: Development of Higher Psychological Processes . Harvard University Press, 14th edition, March 1978. ISBN 0674576292. URL http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0674576292

work page arXiv 1978
[33]

H., Le, Q

Wei, J., Wang, X., Schuurmans, D., Bosma, M., brian ichter, Xia, F., Chi, E. H., Le, Q. V., and Zhou, D. Chain of thought prompting elicits reasoning in large language models. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J

work page 2022
[34]

U., van der Schaar, M., and Agius, R

Werling, M., Seedat, N., Liu, J., Gr nlykke, L., Niemann, C. U., van der Schaar, M., and Agius, R. Tables2traces: Distilling tabular data to improve llm reasoning in healthcare. In EurIPS 2025 Workshop: AI for Tabular Data, 2025

work page 2025
[35]

Sub-task decomposition enables learning in sequence to sequence tasks

Wies, N., Levine, Y., and Shashua, A. Sub-task decomposition enables learning in sequence to sequence tasks. In International Conference on Learning Representations, 2023. URL https://openreview.net/pdf?id=BrJATVZDWEH

work page 2023
[36]

K., Hajimirsadeghi, H., and Mori, G

Xu, T., Zhang, Z., Sun, X., Zung, L. K., Hajimirsadeghi, H., and Mori, G. Tabreason: A reinforcement learning-enhanced reasoning llm for explainable tabular data prediction. arXiv preprint arXiv:2505.21807, 2025

work page arXiv 2025

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

Dnf-net: A neural architecture for tabular data, 2020

Abutbul, A., Elidan, G., Katzir, L., and El-Yaniv, R. Dnf-net: A neural architecture for tabular data, 2020. URL https://arxiv.org/abs/2006.06465

work page arXiv 2020

[3] [3]

Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint:2503.08679, 2025

Arcuschin, I., Janiak, J., Krzyzanowski, R., Rajamanoharan, S., Nanda, N., and Conmy, A. Chain-of-thought reasoning in the wild is not always faithful, 2025. URL https://arxiv. org/abs/2503.08679, 2025

work page arXiv 2025

[4] [4]

Arik, S. O. and Pfister, T. Tabnet: Attentive interpretable tabular learning, 2020. URL https://arxiv.org/abs/1908.07442

work page arXiv 2020

[5] [5]

G., and Augenstein, I

Atanasova, P., Camburu, O.-M., Lioma, C., Lukasiewicz, T., Simonsen, J. G., and Augenstein, I. Faithfulness tests for natural language explanations. arXiv preprint arXiv:2305.18029, 2023

work page arXiv 2023

[6] [6]

Chain-of-thought is not explainability

Barez, F., Wu, T.-Y., Arcuschin, I., Lan, M., Wang, V., Siegel, N., Collignon, N., Neo, C., Lee, I., Paren, A., et al. Chain-of-thought is not explainability. Preprint, alphaXiv, pp.\ v1, 2025

work page 2025

[7] [7]

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., Pieler, M., Prashanth, U. S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S. Gpt-neox-20b: An open-source autoregressive language model, 2022. URL https://arxiv.org/abs/2204.06745

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

Machine Learning 45(1), 5–32 (Oct 2001)

Breiman, L. Random forests. Mach. Learn., 45 0 (1): 0 5–32, October 2001. ISSN 0885-6125. doi:10.1023/A:1010933404324. URL https://doi.org/10.1023/A:1010933404324

work page doi:10.1023/a:1010933404324 2001

[9] [9]

H., Olshen, R

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification and Regression Trees. Wadsworth, 1984. ISBN 0-534-98053-8

work page 1984

[10] [10]

Tabr1: Taming grpo for tabular reasoning llms

Cai, P., Gao, Z., and Chen, J. Tabr1: Taming grpo for tabular reasoning llms. arXiv preprint arXiv:2510.17385, 2025

work page arXiv 2025

[11] [11]

Chen and C

Chen, Tianqi, Guestrin, and Carlos. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp.\ 785–794. ACM, August 2016. doi:10.1145/2939672.2939785. URL http://dx.doi.org/10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016

[12] [12]

Scaling Instruction-Finetuned Language Models

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson, K., Valter, D., Narang, S., Mishra, G., Yu, A., Zhao, V., Huang, Y., Dai, A., Yu, H., Petrov, S., Chi, E. H., Dean, J., Devlin, J., Roberts,...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Diabetes 130-US Hospitals for Years 1999-2008

Clore, J., Cios, K., DeShazo, J., and Strack, B. Diabetes 130-US Hospitals for Years 1999-2008 . UCI Machine Learning Repository, 2014. DOI : 10.24432/C5230J

work page doi:10.24432/c5230j 1999

[14] [14]

Lift: Language-interfaced fine-tuning for non-language machine learning tasks, 2022

Dinh, T., Zeng, Y., Zhang, R., Lin, Z., Gira, M., Rajput, S., yong Sohn, J., Papailiopoulos, D., and Lee, K. Lift: Language-interfaced fine-tuning for non-language machine learning tasks, 2022. URL https://arxiv.org/abs/2206.06565

work page arXiv 2022

[15] [15]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Tabllm: Few-shot classification of tabular data with large language models, 2023

Hegselmann, S., Buendia, A., Lang, H., Agrawal, M., Jiang, X., and Sontag, D. Tabllm: Few-shot classification of tabular data with large language models, 2023. URL https://arxiv.org/abs/2210.10723

work page arXiv 2023

[17] [17]

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

Hollmann, N., Müller, S., Eggensperger, K., and Hutter, F. Tabpfn: A transformer that solves small tabular classification problems in a second, 2023. URL https://arxiv.org/abs/2207.01848

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Hu, J., Liu, J. K., Xu, H., and Shen, W. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization, 2025. URL https://arxiv.org/abs/2501.03262

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

TabTransformer: Tabular Data Modeling Using Contextual Embeddings

Huang, X., Khetan, A., Cvitkovic, M., and Karnin, Z. Tabtransformer: Tabular data modeling using contextual embeddings, 2020. URL https://arxiv.org/abs/2012.06678

work page internal anchor Pith review Pith/arXiv arXiv 2020

[20] [20]

(how) do reasoning models reason? Annals of the New York Academy of Sciences, 1547 0 (1): 0 33--40, 2025

Kambhampati, S., Stechly, K., and Valmeekam, K. (how) do reasoning models reason? Annals of the New York Academy of Sciences, 1547 0 (1): 0 33--40, 2025. doi:https://doi.org/10.1111/nyas.15339. URL https://nyaspubs.onlinelibrary.wiley.com/doi/abs/10.1111/nyas.15339

work page doi:10.1111/nyas.15339 2025

[21] [21]

Lightgbm: A highly efficient gradient boosting decision tree

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://...

work page 2017

[22] [22]

T., Kang, D., Moon, S., Lee, J

Kwon, T., iunn Ong, K. T., Kang, D., Moon, S., Lee, J. R., Hwang, D., Sim, Y., Sohn, B., Lee, D., and Yeo, J. Large language models are clinical reasoners: Reasoning-aware diagnosis framework with prompt-generated rationales, 2024. URL https://arxiv.org/abs/2312.07399

work page arXiv 2024

[23] [23]

Crafting papers on machine learning

Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.\ 1207--1216, Stanford, CA, 2000. Morgan Kaufmann

work page 2000

[24] [24]

Disco: Re- inforcing large reasoning models with discriminative con- strained optimization.arXiv preprint arXiv:2505.12366,

Li, G., Lin, M., Galanti, T., Tu, Z., and Yang, T. Disco: Reinforcing large reasoning models with discriminative constrained optimization. arXiv preprint arXiv:2505.12366, 2025

work page arXiv 2025

[25] [25]

MANCUR OLSON

Moro, S., Rita, P., and Cortez, P. Bank Marketing . UCI Machine Learning Repository, 2014. DOI : 10.24432/C5K306

work page doi:10.24432/c5k306 2014

[26] [26]

Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning

Paul, D., West, R., Bosselut, A., and Faltings, B. Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. arXiv preprint arXiv:2402.13950, 2024

work page arXiv 2024

[27] [27]

Y., Cooper, M., and Krishnan, R

Si, J., Cheng, W. Y., Cooper, M., and Krishnan, R. G. Interpretabnet: Distilling predictive signals from tabular data by salient feature interpretation. arXiv preprint arXiv:2406.00426, 2024

work page arXiv 2024

[28] [28]

and Singh, S

Slack, D. and Singh, S. Tablet: Learning from instructions for tabular data, 2023. URL https://arxiv.org/abs/2304.13188

work page arXiv 2023

[29] [29]

Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, C., Xiao, C., Du, C., Liao, C., et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting

Turpin, M., Michael, J., Perez, E., and Bowman, S. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36: 0 74952--74965, 2023

work page 2023

[31] [31]

Trl: Transformer reinforcement learning

von Werra, L., Belkada, Y., Tunstall, L., Beeching, E., Thrush, T., Lambert, N., Huang, S., Rasul, K., and Gallouédec, Q. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020

work page 2020

[32] [32]

Vygotsky, L. S. Mind in Society: Development of Higher Psychological Processes . Harvard University Press, 14th edition, March 1978. ISBN 0674576292. URL http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0674576292

work page arXiv 1978

[33] [33]

H., Le, Q

Wei, J., Wang, X., Schuurmans, D., Bosma, M., brian ichter, Xia, F., Chi, E. H., Le, Q. V., and Zhou, D. Chain of thought prompting elicits reasoning in large language models. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J

work page 2022

[34] [34]

U., van der Schaar, M., and Agius, R

Werling, M., Seedat, N., Liu, J., Gr nlykke, L., Niemann, C. U., van der Schaar, M., and Agius, R. Tables2traces: Distilling tabular data to improve llm reasoning in healthcare. In EurIPS 2025 Workshop: AI for Tabular Data, 2025

work page 2025

[35] [35]

Sub-task decomposition enables learning in sequence to sequence tasks

Wies, N., Levine, Y., and Shashua, A. Sub-task decomposition enables learning in sequence to sequence tasks. In International Conference on Learning Representations, 2023. URL https://openreview.net/pdf?id=BrJATVZDWEH

work page 2023

[36] [36]

K., Hajimirsadeghi, H., and Mori, G

Xu, T., Zhang, Z., Sun, X., Zung, L. K., Hajimirsadeghi, H., and Mori, G. Tabreason: A reinforcement learning-enhanced reasoning llm for explainable tabular data prediction. arXiv preprint arXiv:2505.21807, 2025

work page arXiv 2025