pith. sign in

arxiv: 2604.13392 · v2 · pith:3DKCT2U7new · submitted 2026-04-15 · 💻 cs.AI

ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold

Pith reviewed 2026-05-21 00:55 UTC · model grok-4.3

classification 💻 cs.AI
keywords tabular data predictionsymbolic scaffoldsdecision treesLLM fine-tuningfaithful reasoninghallucination metricsexplainable AImedical and financial benchmarks
0
0 comments X

The pith

ReSS trains LLMs on decision-tree paths to gain up to 10 percent accuracy on tabular data while keeping reasoning faithful to the tree logic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ReSS extracts decision paths from a tree model for each data row and uses those paths as symbolic scaffolds. The scaffolds steer an LLM to write natural-language reasoning that follows the exact logic of the tree. The resulting dataset fine-tunes a pretrained LLM into a tabular reasoning model, with an added augmentation step that preserves the scaffold. New metrics measure hallucination rate, explanation necessity, and explanation sufficiency. Experiments on medical and financial tables show the trained models outperform both plain decision trees and standard fine-tuning by as much as 10 percent while producing consistent explanations.

Core claim

The central claim is that decision-tree paths can serve as reliable scaffolds to generate grounded natural-language reasoning from an LLM; the resulting dataset, when used for fine-tuning together with scaffold-invariant augmentation, produces models that achieve higher predictive accuracy on tabular tasks and satisfy quantitative faithfulness criteria defined by hallucination, necessity, and sufficiency scores.

What carries the argument

Instance-level decision paths extracted from a decision tree, used as symbolic scaffolds that constrain and ground the LLM's natural-language reasoning generation.

If this is right

  • ReSS-trained models deliver both higher accuracy and measurable faithfulness on tabular prediction in healthcare and finance.
  • The scaffold-guided dataset creation reduces the need for manual reasoning annotations while preserving logical consistency.
  • Scaffold-invariant augmentation improves generalization without altering the core decision structure.
  • The three quantitative metrics provide an objective way to audit whether explanations remain faithful to the underlying logic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar scaffolding could be built from other symbolic structures such as rule sets or logic programs to handle a wider range of reasoning tasks.
  • The same pipeline might be applied to larger or streaming tabular datasets where manual explanation curation is impractical.
  • Hybrid systems could route simple cases to the tree and complex cases to the fine-tuned LLM while using the scaffold for consistency checks.

Load-bearing premise

The LLM-generated reasoning will strictly follow the decision logic in the scaffold without adding contradictions or extraneous inferences that the hallucination, necessity, and sufficiency metrics will reliably detect.

What would settle it

A controlled test in which a substantial fraction of generated explanations on held-out data either contradict the corresponding tree path or receive low scores on the necessity or sufficiency metrics.

Figures

Figures reproduced from arXiv: 2604.13392 by Chenlang Yi, Gang Li, My T. Thai, Tianbao Yang, Tue Minh Cao, Yanmin Gong, Zizhan Xiong.

Figure 1
Figure 1. Figure 1: An illustration of the ReSS pipeline applied to the diabetes prediction problem. 3.2. Using Decision Tree Paths as Symbolic Scaffolds Motivation of Using Symbolic Scaffolds. A consequence of the direct curation approach discussed above is that the generated rationale may contain many non-useful features. An example given in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Explanation sufficiency and necessity analysis for ReSS via feature masking across four tabular datasets, averaged over three random seeds. The x-axis denotes the number of masked features per instance, while the y-axis shows the resulting change in prediction accuracy under masking interventions [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example of step-by-step reasoning curated by ReSS on AD dataset. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example of step-by-step reasoning curated by ReSS on Creditg dataset. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An example of step-by-step reasoning curated by ReSS on Diabetes dataset. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An example of step-by-step reasoning curated by ReSS on Homeloan dataset. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An example of step-by-step reasoning obtained by direct reasoning curation on Alzheimer’s Disease dataset. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: An example of step-by-step reasoning obtained by direct reasoning curation on Creditg dataset. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: An example of step-by-step reasoning obtained by direct reasoning curation on Diabetes dataset. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: An example of step-by-step reasoning obtained by direct reasoning curation on Homeloan dataset. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: An example of delexicalized step-by-step reasoning curated by ReSS on the Diabetes dataset. C. Hyperparameters C.1. Decision Tree For the Decision Tree baseline, we perform grid search over the following hyperparameter space: • max depth ∈ {4, 5, 6, 7} • min samples split ∈ {2, 5, 10, 20} • min samples leaf ∈ {1, 2, 5, 10} • criterion ∈ {gini, entropy} The optimal hyperparameters are selected based on val… view at source ↗
Figure 12
Figure 12. Figure 12: Along this decision path, the decision tree assigns a non-diabetic label, which reflects the empirical training label distribution observed in this localized region of the feature space but is wrong. However, every condition along the path corresponds to a well-established risk factor for diabetes according to the domain knowledge. In contrast, our fine-tuned LLM generates a rationale that faithfully foll… view at source ↗
Figure 13
Figure 13. Figure 13: Ablation study with delexicalized features, conducted without augmented reasoning data. Results are averaged over three random seeds. E.3. Ablation Studies on Conducting RL after ReSS The results are shown in [PITH_FULL_IMAGE:figures/full_fig_p037_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Direct RL vs. ReSS (w/o aug.) + RL [PITH_FULL_IMAGE:figures/full_fig_p037_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: shows the sufficiency and necessity curves on Diabetes and AD. For ReSS, masking unused features results in only minor accuracy changes, while masking explanation-referenced features leads to a sharp and monotonic performance drop, indicating strong explanation necessity. In contrast, DRC+SFT consistently exhibits substantially weaker necessity. On Diabetes, masking features referenced by the explanation … view at source ↗
read the original abstract

Tabular data remains prevalent in high-stakes domains such as healthcare and finance, where predictive models are expected to provide both high accuracy and faithful, human-understandable reasoning. While symbolic models offer verifiable logic, they lack semantic expressiveness. Meanwhile, general-purpose LLMs often require specialized fine-tuning to master domain-specific tabular reasoning. To address the dual challenges of scalable data curation and reasoning consistency, we propose ReSS, a systematic framework that bridges symbolic and neural reasoning models. ReSS leverages a decision-tree model to extract instance-level decision paths as symbolic scaffolds. These scaffolds, alongside input features and labels, guide an LLM to generate grounded natural-language reasoning that strictly adheres to the underlying decision logic. The resulting high-quality dataset is used to fine-tune a pretrained LLM into a specialized tabular reasoning model, further enhanced by a scaffold-invariant data augmentation strategy to improve generalization and explainability. To rigorously assess faithfulness, we introduce quantitative metrics including hallucination rate, explanation necessity, and explanation sufficiency. Experimental results on medical and financial benchmarks demonstrate that ReSS-trained models improve traditional decision trees and standard fine-tuning approaches up to $10\%$ while producing faithful and consistent reasoning

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents ReSS, a framework for tabular data prediction that uses decision trees to generate symbolic scaffolds for guiding LLMs to produce natural-language reasoning. The scaffolds are used to create training data for fine-tuning LLMs, with an additional scaffold-invariant data augmentation. New metrics for faithfulness are proposed, and the method is claimed to achieve up to 10% better performance on medical and financial benchmarks compared to decision trees and standard fine-tuning while maintaining faithful reasoning.

Significance. This work addresses an important problem in explainable AI for tabular data in high-stakes applications. If the results hold, it offers a promising way to integrate symbolic reasoning with neural models for better accuracy and interpretability. The quantitative faithfulness metrics are a notable contribution for evaluating such hybrid systems.

major comments (2)
  1. The abstract states that ReSS-trained models improve upon traditional decision trees and standard fine-tuning approaches up to 10%, but does not specify the exact experimental protocol, baseline details, statistical tests, or ablation results. This leaves the central performance and faithfulness claims without verifiable support.
  2. The hallucination rate, explanation necessity, and explanation sufficiency metrics are introduced to measure adherence to the decision logic in the scaffolds. However, without an independent verification such as human evaluation to confirm they capture deviations or contradictions, the metrics risk being circular and not guaranteeing the strict adherence assumed in the central claim.
minor comments (1)
  1. Consider adding the names of the specific medical and financial benchmarks used in the experiments for better context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address the major comments point by point below, providing clarifications and indicating planned revisions to enhance the manuscript's transparency and rigor.

read point-by-point responses
  1. Referee: The abstract states that ReSS-trained models improve upon traditional decision trees and standard fine-tuning approaches up to 10%, but does not specify the exact experimental protocol, baseline details, statistical tests, or ablation results. This leaves the central performance and faithfulness claims without verifiable support.

    Authors: The abstract is intentionally concise to highlight the main contributions. Detailed descriptions of the experimental protocol, including the specific medical and financial benchmarks, baseline models (decision trees and vanilla LLM fine-tuning), statistical significance testing, and ablation studies on scaffold and augmentation components, are provided in Section 4 of the manuscript with results in the tables. We will update the abstract to include a brief reference to the evaluation setup and cross-references to the experimental section to improve immediate verifiability. revision: yes

  2. Referee: The hallucination rate, explanation necessity, and explanation sufficiency metrics are introduced to measure adherence to the decision logic in the scaffolds. However, without an independent verification such as human evaluation to confirm they capture deviations or contradictions, the metrics risk being circular and not guaranteeing the strict adherence assumed in the central claim.

    Authors: We appreciate the concern regarding potential circularity. The metrics are defined through direct, objective computations: hallucination rate detects logical contradictions via entailment against the scaffold, necessity quantifies performance drop upon explanation removal, and sufficiency checks predictive power of the explanation in isolation. These are rule-based and independent of the LLM generation process itself. To further strengthen validation, we will incorporate a human evaluation on a sample subset in the revised manuscript to demonstrate correlation with the automated scores. revision: yes

Circularity Check

0 steps flagged

Independent decision-tree scaffolds generated prior to LLM involvement keep derivation self-contained

full rationale

The paper first fits a decision-tree model to extract instance-level decision paths as symbolic scaffolds from the tabular data. These scaffolds are then used to prompt an LLM for natural-language reasoning generation, followed by fine-tuning and evaluation via hallucination rate, necessity, and sufficiency metrics. Because the tree-derived scaffolds are produced by an independent symbolic model before any LLM step and the metrics operate on the generated outputs rather than re-using the same fitted parameters, no load-bearing step reduces by construction to its own inputs. The central claims rest on experimental comparisons rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the premise that decision paths extracted from a trained tree constitute a complete and faithful symbolic representation of the predictive logic that an LLM can be made to follow without semantic loss.

axioms (1)
  • domain assumption A decision-tree model trained on tabular data yields instance-level decision paths that accurately encode the logic used to reach each prediction.
    This premise is invoked when the paths are extracted and supplied to the LLM as scaffolds.

pith-pipeline@v0.9.0 · 5752 in / 1419 out tokens · 41624 ms · 2026-05-21T00:55:36.605000+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 7 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Dnf-net: A neural architecture for tabular data, 2020

    Abutbul, A., Elidan, G., Katzir, L., and El-Yaniv, R. Dnf-net: A neural architecture for tabular data, 2020. URL https://arxiv.org/abs/2006.06465

  3. [3]

    Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint:2503.08679, 2025

    Arcuschin, I., Janiak, J., Krzyzanowski, R., Rajamanoharan, S., Nanda, N., and Conmy, A. Chain-of-thought reasoning in the wild is not always faithful, 2025. URL https://arxiv. org/abs/2503.08679, 2025

  4. [4]

    Arik, S. O. and Pfister, T. Tabnet: Attentive interpretable tabular learning, 2020. URL https://arxiv.org/abs/1908.07442

  5. [5]

    G., and Augenstein, I

    Atanasova, P., Camburu, O.-M., Lioma, C., Lukasiewicz, T., Simonsen, J. G., and Augenstein, I. Faithfulness tests for natural language explanations. arXiv preprint arXiv:2305.18029, 2023

  6. [6]

    Chain-of-thought is not explainability

    Barez, F., Wu, T.-Y., Arcuschin, I., Lan, M., Wang, V., Siegel, N., Collignon, N., Neo, C., Lee, I., Paren, A., et al. Chain-of-thought is not explainability. Preprint, alphaXiv, pp.\ v1, 2025

  7. [7]

    GPT-NeoX-20B: An Open-Source Autoregressive Language Model

    Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., Pieler, M., Prashanth, U. S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S. Gpt-neox-20b: An open-source autoregressive language model, 2022. URL https://arxiv.org/abs/2204.06745

  8. [8]

    Machine Learning 45(1), 5–32 (Oct 2001)

    Breiman, L. Random forests. Mach. Learn., 45 0 (1): 0 5–32, October 2001. ISSN 0885-6125. doi:10.1023/A:1010933404324. URL https://doi.org/10.1023/A:1010933404324

  9. [9]

    H., Olshen, R

    Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification and Regression Trees. Wadsworth, 1984. ISBN 0-534-98053-8

  10. [10]

    Tabr1: Taming grpo for tabular reasoning llms

    Cai, P., Gao, Z., and Chen, J. Tabr1: Taming grpo for tabular reasoning llms. arXiv preprint arXiv:2510.17385, 2025

  11. [11]

    Chen and C

    Chen, Tianqi, Guestrin, and Carlos. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp.\ 785–794. ACM, August 2016. doi:10.1145/2939672.2939785. URL http://dx.doi.org/10.1145/2939672.2939785

  12. [12]

    Scaling Instruction-Finetuned Language Models

    Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson, K., Valter, D., Narang, S., Mishra, G., Yu, A., Zhao, V., Huang, Y., Dai, A., Yu, H., Petrov, S., Chi, E. H., Dean, J., Devlin, J., Roberts,...

  13. [13]

    Diabetes 130-US Hospitals for Years 1999-2008

    Clore, J., Cios, K., DeShazo, J., and Strack, B. Diabetes 130-US Hospitals for Years 1999-2008 . UCI Machine Learning Repository, 2014. DOI : 10.24432/C5230J

  14. [14]

    Lift: Language-interfaced fine-tuning for non-language machine learning tasks, 2022

    Dinh, T., Zeng, Y., Zhang, R., Lin, Z., Gira, M., Rajput, S., yong Sohn, J., Papailiopoulos, D., and Lee, K. Lift: Language-interfaced fine-tuning for non-language machine learning tasks, 2022. URL https://arxiv.org/abs/2206.06565

  15. [15]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  16. [16]

    Tabllm: Few-shot classification of tabular data with large language models, 2023

    Hegselmann, S., Buendia, A., Lang, H., Agrawal, M., Jiang, X., and Sontag, D. Tabllm: Few-shot classification of tabular data with large language models, 2023. URL https://arxiv.org/abs/2210.10723

  17. [17]

    TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

    Hollmann, N., Müller, S., Eggensperger, K., and Hutter, F. Tabpfn: A transformer that solves small tabular classification problems in a second, 2023. URL https://arxiv.org/abs/2207.01848

  18. [18]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Hu, J., Liu, J. K., Xu, H., and Shen, W. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization, 2025. URL https://arxiv.org/abs/2501.03262

  19. [19]

    TabTransformer: Tabular Data Modeling Using Contextual Embeddings

    Huang, X., Khetan, A., Cvitkovic, M., and Karnin, Z. Tabtransformer: Tabular data modeling using contextual embeddings, 2020. URL https://arxiv.org/abs/2012.06678

  20. [20]

    (how) do reasoning models reason? Annals of the New York Academy of Sciences, 1547 0 (1): 0 33--40, 2025

    Kambhampati, S., Stechly, K., and Valmeekam, K. (how) do reasoning models reason? Annals of the New York Academy of Sciences, 1547 0 (1): 0 33--40, 2025. doi:https://doi.org/10.1111/nyas.15339. URL https://nyaspubs.onlinelibrary.wiley.com/doi/abs/10.1111/nyas.15339

  21. [21]

    Lightgbm: A highly efficient gradient boosting decision tree

    Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://...

  22. [22]

    T., Kang, D., Moon, S., Lee, J

    Kwon, T., iunn Ong, K. T., Kang, D., Moon, S., Lee, J. R., Hwang, D., Sim, Y., Sohn, B., Lee, D., and Yeo, J. Large language models are clinical reasoners: Reasoning-aware diagnosis framework with prompt-generated rationales, 2024. URL https://arxiv.org/abs/2312.07399

  23. [23]

    Crafting papers on machine learning

    Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.\ 1207--1216, Stanford, CA, 2000. Morgan Kaufmann

  24. [24]

    Disco: Re- inforcing large reasoning models with discriminative con- strained optimization.arXiv preprint arXiv:2505.12366,

    Li, G., Lin, M., Galanti, T., Tu, Z., and Yang, T. Disco: Reinforcing large reasoning models with discriminative constrained optimization. arXiv preprint arXiv:2505.12366, 2025

  25. [25]

    MANCUR OLSON

    Moro, S., Rita, P., and Cortez, P. Bank Marketing . UCI Machine Learning Repository, 2014. DOI : 10.24432/C5K306

  26. [26]

    Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning

    Paul, D., West, R., Bosselut, A., and Faltings, B. Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. arXiv preprint arXiv:2402.13950, 2024

  27. [27]

    Y., Cooper, M., and Krishnan, R

    Si, J., Cheng, W. Y., Cooper, M., and Krishnan, R. G. Interpretabnet: Distilling predictive signals from tabular data by salient feature interpretation. arXiv preprint arXiv:2406.00426, 2024

  28. [28]

    and Singh, S

    Slack, D. and Singh, S. Tablet: Learning from instructions for tabular data, 2023. URL https://arxiv.org/abs/2304.13188

  29. [29]

    Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, C., Xiao, C., Du, C., Liao, C., et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

  30. [30]

    Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting

    Turpin, M., Michael, J., Perez, E., and Bowman, S. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36: 0 74952--74965, 2023

  31. [31]

    Trl: Transformer reinforcement learning

    von Werra, L., Belkada, Y., Tunstall, L., Beeching, E., Thrush, T., Lambert, N., Huang, S., Rasul, K., and Gallouédec, Q. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020

  32. [32]

    Vygotsky, L. S. Mind in Society: Development of Higher Psychological Processes . Harvard University Press, 14th edition, March 1978. ISBN 0674576292. URL http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0674576292

  33. [33]

    H., Le, Q

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., brian ichter, Xia, F., Chi, E. H., Le, Q. V., and Zhou, D. Chain of thought prompting elicits reasoning in large language models. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J

  34. [34]

    U., van der Schaar, M., and Agius, R

    Werling, M., Seedat, N., Liu, J., Gr nlykke, L., Niemann, C. U., van der Schaar, M., and Agius, R. Tables2traces: Distilling tabular data to improve llm reasoning in healthcare. In EurIPS 2025 Workshop: AI for Tabular Data, 2025

  35. [35]

    Sub-task decomposition enables learning in sequence to sequence tasks

    Wies, N., Levine, Y., and Shashua, A. Sub-task decomposition enables learning in sequence to sequence tasks. In International Conference on Learning Representations, 2023. URL https://openreview.net/pdf?id=BrJATVZDWEH

  36. [36]

    K., Hajimirsadeghi, H., and Mori, G

    Xu, T., Zhang, Z., Sun, X., Zung, L. K., Hajimirsadeghi, H., and Mori, G. Tabreason: A reinforcement learning-enhanced reasoning llm for explainable tabular data prediction. arXiv preprint arXiv:2505.21807, 2025