pith. sign in

arxiv: 2606.05402 · v1 · pith:ZG3WQKCWnew · submitted 2026-06-03 · 💻 cs.CL · cs.AI

ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

Pith reviewed 2026-06-28 06:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords reasoning tracesdiscourse structuresdirected acyclic graphslarge reasoning modelserror analysisself-correctiontrace monitoringLLM reasoning
0
0 comments X

The pith

Large reasoning models from different bases produce structurally similar reasoning traces when mapped to discourse DAGs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ReasoningFlow as a way to convert the non-linear reasoning traces of large reasoning models into directed acyclic graphs that encode discourse relations such as backtracking, self-correction, verification, and assumptions. After validating an annotation schema on 31 traces, the method is scaled to 1,260 traces across math, science, and argumentation tasks and five models. The resulting graphs reveal that traces look alike across models, that most erroneous steps are never used in the final answer, and that causal dependencies among steps do not match the language-level discourse relations. These observations matter because they supply a concrete representation for inspecting and potentially intervening in the internal steps of models whose outputs are otherwise hard to audit.

Core claim

ReasoningFlow turns LRM reasoning traces into fine-grained DAGs of discourse structures. The graphs show that traces remain structurally similar across models trained from different bases and post-training data. They expose fine-grained behaviors such as local verification and self-reflection. Most erroneous steps do not contribute to the final answer. Mechanistic causal links between steps diverge from the language-level discourse structure captured in the graphs.

What carries the argument

ReasoningFlow, the framework that converts reasoning traces into directed acyclic graphs whose nodes and edges represent discourse relations among individual reasoning steps.

If this is right

  • Structurally similar traces across models imply that post-training converges on common discourse patterns even when base models and data differ.
  • The finding that most erroneous steps are unused suggests that monitoring only the steps that reach the answer could improve trace evaluation.
  • The mismatch between causal step dependencies and discourse structure indicates that language-level analysis and mechanistic analysis must be treated as separate layers.
  • Diverse fine-grained behaviors visible in the graphs supply new categories for automated reasoning-trace monitors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the graphs truly isolate contributory steps, downstream systems could prune non-contributory branches at inference time to reduce compute.
  • The similarity result raises the question whether the same discourse patterns appear in non-reasoning generative tasks once the same annotation is applied.
  • The released dataset of 1,260 annotated traces could serve as training data for models that predict or correct discourse structure rather than only final answers.

Load-bearing premise

The annotation schema, checked on only 31 manually labeled traces, still captures the same discourse relations when applied automatically to the full set of 1,260 traces.

What would settle it

An additional large reasoning model whose automatically produced ReasoningFlow graphs exhibit markedly different distributions of node types, edge patterns, or error usage from the five models already studied.

Figures

Figures reproduced from arXiv: 2606.05402 by Amruta Parulekar, Dilek Hakkani-Tur, Jinu Lee, Julia Hockenmaier, Shivam Agarwal, Siddarth Madala.

Figure 1
Figure 1. Figure 1: Example of a ReasoningFlow graph. Reason [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Principal Component Analysis (PCA) plot of triplet probability distribution, showing clusters of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average number of nodes and edges per (model, dataset). While the graph size varies, the aver￾age degree (edge/node) remains in 1.3-2.0 (gray region). 5.2 Comparison between models/domains Finding 1 Different LRMs exhibit similar Reason￾ingFlow structures (triplet distributions). To investigate how reasoning structures vary across models and domains, we compare the distribution of (node label, edge label, … view at source ↗
Figure 4
Figure 4. Figure 4: Examples of three fine-grained reasoning behaviors (local verification, self-reflection, and assumption). [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Self-reflection sentiment correlates with node [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Local verification happens more frequently [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of three types of error handling in LRMs (unused, neglected, and faithful error propagation). [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Precision-Recall curve of using Thought An [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Confusion matrix (row-normalized) of Gemini-3.1-Pro annotations compared to human annotations. [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of a misalignment between human and LLM segmentation. Even if segmentations are [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: P-R curve of using Thought Anchors’ scores [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
read the original abstract

Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.1k steps), achieving high inter-annotator agreement, then scale to automatic annotation of 1,260 traces (247.7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2.5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non-overlapping post-training data. (2) ReasoningFlow reveals diverse fine-grained reasoning behaviors (e.g., local verification, self-reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language-level discourse structure. We release the dataset and code in: https://github.com/jinulee-v/reasoningflow.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ReasoningFlow, a framework that converts LRM reasoning traces into fine-grained DAGs encoding discourse relations such as backtracking, self-correction, local verification, and assumptions. The authors manually annotate 31 traces (2.1k steps) achieving high inter-annotator agreement, then apply an automatic annotator to scale to 1,260 traces (247.7k steps) across math, science, and argumentation tasks and five models (Qwen2.5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B). From the resulting graphs they report four findings: (1) structurally similar traces across models despite different training, (2) diverse fine-grained behaviors useful for monitorability, (3) most erroneous steps are not used to derive final answers, and (4) mechanistic causal dependencies diverge from language-level discourse structure. Dataset and code are released.

Significance. If the automatic annotations faithfully capture the intended discourse relations, the framework supplies a concrete, graph-based representation that could improve reasoning-trace monitoring and error analysis in LRMs. The public release of the 1,260-trace dataset and annotation code is a clear strength that supports reproducibility and follow-on work.

major comments (2)
  1. [Automatic annotation procedure (methods section following the manual annotation description)] The annotation schema is validated only via high IAA on the 31 manually labeled traces (2.1k steps); no held-out quantitative evaluation (precision, recall, or F1), error analysis on a larger sample, or description of how the automatic annotator was trained or prompted is supplied. This directly undermines the reliability of the four findings, all of which rest on the automatically produced DAGs for the full 1,260 traces.
  2. [Results section reporting finding (3)] Finding (3) states that 'most of the erroneous steps are not used to derive final answers,' yet the manuscript provides no explicit definition or operationalization of how erroneous steps are identified or how 'used to derive' is determined within the DAGs. Without these details the statistic cannot be independently verified.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by reporting the actual IAA value and at least one key graph statistic (e.g., average number of backtracking edges per trace).
  2. [Methods] Clarify whether the automatic annotator was applied uniformly across all five models or whether model-specific prompting was used; any differences would affect the cross-model similarity claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments, which highlight important areas for clarification. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our methods and results.

read point-by-point responses
  1. Referee: The annotation schema is validated only via high IAA on the 31 manually labeled traces (2.1k steps); no held-out quantitative evaluation (precision, recall, or F1), error analysis on a larger sample, or description of how the automatic annotator was trained or prompted is supplied. This directly undermines the reliability of the four findings, all of which rest on the automatically produced DAGs for the full 1,260 traces.

    Authors: We agree that the current manuscript lacks sufficient detail on the automatic annotation procedure. In the revised version, we will add a dedicated subsection describing the prompting strategy for the automatic annotator (including the exact prompts and few-shot examples used), along with a quantitative evaluation on a held-out sample of 50 traces. This evaluation will report precision, recall, and F1 scores by comparing automatic annotations to additional manual labels. We will also include an error analysis discussing common failure modes. These additions will directly support the reliability of the reported findings. revision: yes

  2. Referee: Finding (3) states that 'most of the erroneous steps are not used to derive final answers,' yet the manuscript provides no explicit definition or operationalization of how erroneous steps are identified or how 'used to derive' is determined within the DAGs. Without these details the statistic cannot be independently verified.

    Authors: We acknowledge the need for explicit operational definitions. In the revision, we will define 'erroneous steps' as those involving incorrect calculations, factual errors, or invalid inferences, as flagged by discourse relations such as self-correction or assumption nodes and verified against task ground truth where applicable. 'Used to derive the final answer' will be operationalized as the existence of a directed path in the ReasoningFlow DAG from the erroneous step to the final answer node. We will include this definition in the results section, along with illustrative examples from the dataset to enable independent verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical annotation and statistics

full rationale

The paper introduces an annotation schema for discourse structures in LRM traces, validates it manually on 31 traces with reported IAA, then applies automatic annotation to produce DAGs for 1,260 traces and computes graph statistics. No equations, derivations, fitted parameters presented as predictions, or self-citation chains are used to support the central claims; the reported structural similarities and error-usage patterns are direct observations from the constructed graphs rather than reductions to inputs by construction. The method is self-contained against external benchmarks in the sense that claims rest on observable data patterns, not tautological redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central contribution rests on the validity of a new annotation schema for discourse structures; no numerical free parameters are introduced, and the main modeling choice is the DAG representation itself.

axioms (1)
  • domain assumption Reasoning traces contain identifiable discourse structures that can be consistently labeled as nodes and directed edges in a DAG.
    This premise is required for the annotation schema and is justified only by the reported inter-annotator agreement on 31 traces.
invented entities (1)
  • ReasoningFlow DAG no independent evidence
    purpose: To represent fine-grained discourse relations in LRM reasoning traces.
    New representational object introduced by the paper; no independent evidence outside the annotation process is provided.

pith-pipeline@v0.9.1-grok · 5811 in / 1234 out tokens · 45803 ms · 2026-06-28T06:04:03.829253+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 12 canonical work pages · 8 internal anchors

  1. [1]

    Training Verifiers to Solve Math Word Problems

    Training Verifiers to Solve Math Word Prob- lems.CoRR, abs/2110.14168. Antonia Creswell, Murray Shanahan, and Irina Higgins

  2. [2]

    In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023

    Selection-Inference: Exploiting Large Lan- guage Models for Interpretable logical Reasoning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. Bhavana Dalvi, Peter Jansen, Oyvind Tafjord, Zhengnan Xie, Hannah Smith, Leighanna Pipatanangkura, and Peter Clark. 2021. Explaining Ans...

  3. [3]

    DeepSeek-V3 Technical Report

    Association for Computational Linguistics. DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. CoRR, abs/2412.19437. Chris H. Q. Ding and Xiaofeng He. 2004. \emphK- means clustering via principal component analysis. InMachine Learning, Proceedings of the Twenty- first International Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004. ACM. 9 Yufeng...

  4. [4]

    A Survey on LLM-as-a-Judge

    AAAI Press. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Yuanzhuo Wang, and Jian Guo. 2024. A Survey on LLM-as-a-Judge. CoRR, abs/2411.15594. Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zay...

  5. [5]

    OpenThoughts: Data Recipes for Reasoning Models

    OpenThoughts: Data Recipes for Reasoning Models.CoRR, abs/2506.04178. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, and 175 oth- ers. 2025. DeepSeek-R1 incentivizes reasoning in LL...

  6. [6]

    Association for Computational Linguistics

    Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 18468–18489. Association for Computational Linguistics. Gangwei Jiang, Yahui Liu, Zhaoyi Li, Wei Bi, Fuzhen...

  7. [7]

    InThe Twelfth International Conference on Learning Representa- tions, ICLR 2024, Vienna, Austria, May 7-11, 2024

    Let’s Verify Step by Step. InThe Twelfth International Conference on Learning Representa- tions, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. Chenyu Lin, Yilin Wen, Du Su, Hexiang Tan, Fei Sun, Muhan Chen, Chenfu Bao, and Zhonghou Lyu

  8. [8]

    ArXiv:2506.05154 [cs] version: 2

    Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement.arXiv preprint. ArXiv:2506.05154 [cs] version: 2. Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, and Hao Su. 2023. Deductive Verification of Chain-of-Thought Reason- ing. InAdvances in Neural Information Processing Systems 36: Annual Conference o...

  9. [9]

    Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems

    Imitate, Explore, and Self-Improve: A Repro- duction Report on Slow-thinking reasoning Systems. CoRR, abs/2412.09413. 11 Mathieu Morey, Philippe Muller, and Nicholas Asher

  10. [10]

    gpt-oss-120b & gpt-oss-20b Model Card

    How much progress have we made on RST dis- course parsing? A replication study of recent results on the RST-DT. InProceedings of the 2017 Con- ference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 1319–1324. Associa- tion for Computational Linguistics. Sagnik Mukherjee, Abhinav Chinta, Ta...

  11. [11]

    Solving math word problems with process- and outcome-based feedback

    Solving math word problems with process- and outcome-based feedback.CoRR, abs/2211.14275. Nicole Van Hoeck, Patrick D. Watson, and Aron K. Bar- bey. 2015. Cognitive neuroscience of human counter- factual reasoning.Frontiers in Human Neuroscience, 9:420. Douglas Walton, Christopher Reed, and Fabrizio Macagno. 2008. Argumentation Schemes. Cam- bridge Univer...

  12. [12]

    Tree of Thoughts: Deliberate Problem Solv- ing with Large Language Models. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Sys- tems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Evelyn Yee, Alice Li, Chenyu Tang, Yeon Ho Jung, Ramamohan Paturi, and Leon Bergen. 2024. Dissoci- ...

  13. [13]

    Can Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-Thought

    Can Aha Moments Be Fake? Identifying True and Decorative Thinking steps in Chain-of-Thought. CoRR, abs/2510.24941. Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2025. ProcessBench: Iden- tifying Process Errors in Mathematical Reasoning. InProceedings of the 63rd Annual Meeting of th...

  14. [14]

    Now, for n≥2:

    OpenReview.net. Jiaru Zou, Ling Yang, Jingwen Gu, Jiahao Qiu, Ke Shen, Jingrui He, and Mengdi Wang. 2025. ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs.CoRR, abs/2506.18896. 13 A ReasoningFlow annotation guide This section includes annotation guides for nodes and edges. For each subtype of nodes and edges, we provide on...

  15. [15]

    but" ( contrast) or

    identified the global structure of reasoning traces in four stages. The framework states that LRMs first tend to restate the problem in their own language (Problem definition), derive an initial so- lution (Bloom), try recomputation or alternative approaches to verify the initial solution (Recon- struction), and decide the final answer (Final de- cision)....

  16. [16]

    Capital punishment is against god’s will

    for detecting errors in LRM traces (GPT-4- Turbo 37.4%). Error detection with PRMs.Process Reward Models (PRMs) (Uesato et al., 2022; Lightman et al., 2024) are LLM-based classifiers specifically trained to predict whether the given step is correct or not. However, we do not apply PRMs for several reasons. First, state-of-the-art PRMs like Qwen2.5- Math-P...