pith. machine review for the scientific record. sign in

arxiv: 2605.11169 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: no theorem link

OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents

Jiawei Han, Jingbo Shang, Julian McAuley, Junda Wu, Nikki Lijing Kuang, Sheldon Yu, Sizhe Zhou, Tong Yu, Xintong Li

Pith reviewed 2026-05-13 02:20 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsReActonline adaptationcontextual banditsinference-time learningaction selectiondecision making
0
0 comments X

The pith

OLIVIA models the action-selection step of ReAct LLM agents as a contextual linear bandit over frozen hidden states to enable direct online updates from feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ReAct agents interleave reasoning and tool calls but can accumulate small action errors across repeated tasks in deployment. Existing fixes rely on prompting or retrieval that leave no explicit, updatable decision layer. OLIVIA instead treats the final action choice as a contextual linear bandit whose contexts are the LLM's frozen hidden states; upper-confidence-bound selection then drives lightweight, uncertainty-aware updates after each action outcome. The approach leaves the underlying reasoning process untouched while supplying trackable adaptation and explicit uncertainty estimates. Experiments on four benchmarks show consistent gains over both static ReAct and prompt-based inference-time baselines.

Core claim

By representing the LLM's final action-selection layer as a contextual linear bandit whose contexts are frozen hidden states, OLIVIA supplies an explicit decision interface that admits online linear updates from action-level feedback and upper-confidence-bound exploration while preserving the original reasoning trace.

What carries the argument

contextual linear bandit whose contexts are the LLM's frozen hidden states and whose arms are candidate actions, updated online with upper-confidence-bound selection

If this is right

  • Task performance improves consistently over static ReAct and prompt-based baselines on four benchmarks.
  • Policy updates occur sample-efficiently with only lightweight linear updates and no retraining of the underlying LLM.
  • Adaptation remains trackable because each update is an explicit change to the bandit parameters rather than opaque prompt edits.
  • Uncertainty estimates are available at every action choice because the bandit maintains explicit variance terms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same frozen-state bandit construction could be attached to other agent scaffolds that expose an action-selection interface.
  • If hidden states prove linearly insufficient on harder domains, the framework would naturally motivate richer context features or nonlinear bandits.
  • Repeated deployment on related tasks could produce a growing library of bandit parameters that transfer across similar subtasks without prompt engineering.

Load-bearing premise

The frozen hidden states contain enough information for a linear model to recover action values and uncertainty without substantial loss from the preceding reasoning steps.

What would settle it

On any of the four benchmarks, OLIVIA produces success rates or completion times statistically indistinguishable from or worse than the static ReAct baseline when the same prompts and LLM are used.

Figures

Figures reproduced from arXiv: 2605.11169 by Jiawei Han, Jingbo Shang, Julian McAuley, Junda Wu, Nikki Lijing Kuang, Sheldon Yu, Sizhe Zhou, Tong Yu, Xintong Li.

Figure 1
Figure 1. Figure 1: Overview of OLIVIA. Left: A frozen LLM backbone processes the task and trajectory prefix through its transformer layers, producing hidden states and a reasoning trace (Thought). Right: At each action-selection step, OLIVIA extracts the last-layer hidden state as the decision context and scores candidate tools using per-action UCB estimates. (Wu et al., a), and parameter-internalized context (Wang et al., 2… view at source ↗
Figure 2
Figure 2. Figure 2: Running-average F1 over the episode stream on the four benchmarks. Static [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Synthetic experiments. Left: cumulative regret over rounds. Right: parameter esti [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qwen3-4B on ToolBench. 0 100 200 300 400 500 Step 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 0.225 Value Qwen / Taskbench BM25 C2S CoT ReAct CLIn [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qwen3-4B on TaskBench-MM. 0 100 200 300 400 500 Step 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Value Qwen / Bfcl BM25 C2S CoT ReAct CLIn [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Mistral-7B on ToolBench. 0 100 200 300 400 500 Step 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 0.225 Value Mistral / Taskbench BM25 C2S CoT ReAct CLIn [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Mistral-7B on TaskBench-MM. 0 100 200 300 400 500 Step 0.10 0.15 0.20 0.25 0.30 0.35 Value Mistral / Bfcl BM25 C2S CoT ReAct CLIn [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Reward Studies. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
read the original abstract

Large language model agents interleave reasoning, action selection, and observation to solve sequential decision-making tasks. In deployed settings where agents repeatedly handle related multi-step tasks, small action-selection errors can accumulate into wasted tool calls, latency, and reduced reliability. Despite this need for deployment-time improvement, existing inference-time adaptation methods for LLM agents mainly rely on prompting or retrieval, which influence behavior indirectly through context manipulation. For ReAct-style agents, such approaches do not expose an explicit decision layer that can score candidate actions, represent uncertainty, or be updated online from action-level feedback. As a result, they provide limited support for trackable, fine-grained, and uncertainty-aware adaptation during deployment. We propose OLIVIA, an inference-time action adaptation framework for ReAct-style agents. OLIVIA models the LLM's final action-selection layer as a contextual linear bandit over candidate actions, with frozen hidden states as decision contexts. This choice is particularly suitable for deployment because it adapts behavior directly at the action-selection interface, preserves the underlying reasoning process, and provides explicit uncertainty estimates and lightweight online updates from action-level feedback. With upper-confidence-bound exploration, OLIVIA improves the policy sample-efficiently with minimal computational overhead. We instantiate OLIVIA on four benchmarks and show that it consistently improves task performance over static ReAct and prompt-based inference-time baselines. Our results suggest that explicit online decision layers provide an effective alternative to purely prompt- or retrieval-based adaptation for LLM agents during deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes OLIVIA, an inference-time adaptation method for ReAct-style LLM agents. It treats the final hidden state before action selection as the context vector for a contextual linear bandit over candidate actions, maintains per-action parameters, and applies UCB exploration with online updates from action-level feedback. The central claim is that this yields consistent performance gains on four benchmarks relative to static ReAct and prompt-based inference-time baselines while preserving the original reasoning process and adding explicit uncertainty estimates at low computational cost.

Significance. If the linear-bandit construction on frozen hidden states proves reliable, the work supplies a lightweight, uncertainty-aware, and directly updatable decision layer for deployed LLM agents. This is a meaningful alternative to purely prompt- or retrieval-based adaptation, especially for repeated multi-step tasks where small action errors accumulate. The approach is notable for its explicit modeling of action values and uncertainty rather than indirect context manipulation.

major comments (2)
  1. [Method description (abstract and §3)] The headline performance claim rests on the assumption that the LLM's final hidden state before action selection encodes all task-relevant information produced by preceding ReAct reasoning steps and that the mapping from this state to expected reward is approximately linear. Neither property is guaranteed: reasoning information may be distributed across the token sequence rather than concentrated in the last token, and LLM representations often require non-linear probes for downstream quantities. If either fails, the bandit receives noisy or biased targets, UCB exploration becomes ineffective, and observed gains could be explained by prompt sensitivity or benchmark idiosyncrasies rather than the adaptation mechanism.
  2. [Experiments (§5)] The experimental section asserts consistent improvements on four benchmarks but supplies no statistical significance tests, ablation results isolating the bandit component, details on hidden-state extraction or dimensionality, or implementation specifics for the UCB updates and reward signals. Without these, it is impossible to verify that gains are attributable to the proposed online decision layer rather than other factors.
minor comments (2)
  1. [§3] Notation for the context vector x_t and per-action parameters should be introduced with explicit equations rather than prose description to improve reproducibility.
  2. [§4] The paper would benefit from a short discussion of how action-level feedback is obtained in each benchmark (e.g., success/failure signals or reward definitions).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed comments on our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions made to strengthen the paper.

read point-by-point responses
  1. Referee: [Method description (abstract and §3)] The headline performance claim rests on the assumption that the LLM's final hidden state before action selection encodes all task-relevant information produced by preceding ReAct reasoning steps and that the mapping from this state to expected reward is approximately linear. Neither property is guaranteed: reasoning information may be distributed across the token sequence rather than concentrated in the last token, and LLM representations often require non-linear probes for downstream quantities. If either fails, the bandit receives noisy or biased targets, UCB exploration becomes ineffective, and observed gains could be explained by prompt sensitivity or benchmark idiosyncrasies rather than the adaptation mechanism.

    Authors: We acknowledge the validity of this concern regarding the assumptions underlying the contextual linear bandit construction. While the final hidden state is not theoretically guaranteed to concentrate all task-relevant information (as reasoning steps may distribute information across tokens), the ReAct paradigm explicitly generates and interleaves reasoning traces immediately prior to action selection, making the last hidden state a natural and commonly used context representation in LLM decision-making literature. We have revised Section 3 to include an expanded discussion of this design choice, its motivations, and potential limitations, along with references to related work employing last-token representations. On linearity, we do not assert it holds universally but note that the linear model enables lightweight, uncertainty-aware updates suitable for deployment; the consistent empirical gains over strong baselines (including prompt-based methods that control for indirect context effects) suggest the approximation is effective in practice. We have also added further controls in the experiments (see response to the second comment) to help attribute gains to the adaptation mechanism rather than idiosyncrasies. revision: partial

  2. Referee: [Experiments (§5)] The experimental section asserts consistent improvements on four benchmarks but supplies no statistical significance tests, ablation results isolating the bandit component, details on hidden-state extraction or dimensionality, or implementation specifics for the UCB updates and reward signals. Without these, it is impossible to verify that gains are attributable to the proposed online decision layer rather than other factors.

    Authors: We agree that the original experimental section was insufficiently detailed to allow full verification of the source of the reported gains. In the revised manuscript, we have substantially expanded Section 5 and the appendix with the following: statistical significance testing via paired t-tests and bootstrap confidence intervals across multiple random seeds, with p-values reported for all comparisons; ablation studies that isolate the bandit components (e.g., static linear model without UCB exploration, non-updating parameters, and random action selection); explicit details on hidden-state extraction (last token of the final reasoning step prior to action selection, using the model's native hidden dimension such as 4096 for the evaluated LLMs); and full implementation specifics for UCB (exploration coefficient α = 1.0, online ridge-regression updates, and binary reward signals derived from action-level success or task completion). These additions directly address the referee's points and enable readers to confirm that improvements arise from the online decision layer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard contextual bandit applied to LLM states with empirical validation

full rationale

The paper introduces OLIVIA by directly defining the action-selection layer as a contextual linear bandit using frozen hidden states as contexts and applying UCB for online updates from action feedback. This is an application of an existing algorithm (contextual bandits) to new inputs (LLM hidden states), not a derivation that reduces claimed improvements to fitted parameters or self-referential definitions by construction. Performance gains are shown via experiments on four benchmarks against baselines, with no load-bearing self-citations, uniqueness theorems, or ansatzes that collapse the result to its inputs. The chain remains self-contained as a proposed framework plus empirical demonstration.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLM hidden states serve as adequate linear contexts for action values and on the standard UCB algorithm; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Frozen LLM hidden states can be treated as contexts for a linear model of action values.
    This is the modeling choice that allows the bandit formulation at the action-selection interface.

pith-pipeline@v0.9.0 · 5596 in / 1243 out tokens · 61508 ms · 2026-05-13T02:20:03.646401+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

108 extracted references · 108 canonical work pages · 5 internal anchors

  1. [1]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  2. [2]

    2025 , eprint=

    Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity , author=. 2025 , eprint=

  3. [3]

    2023 , eprint=

    RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning by Reversing Chain-of-Thought , author=. 2023 , eprint=

  4. [4]

    2023 , eprint=

    ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

  5. [5]

    2023 , eprint=

    Reflexion: Language Agents with Verbal Reinforcement Learning , author=. 2023 , eprint=

  6. [6]

    2024 , eprint=

    ExpeL: LLM Agents Are Experiential Learners , author=. 2024 , eprint=

  7. [7]

    2023 , eprint=

    Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. 2023 , eprint=

  8. [8]

    2023 , eprint=

    MemoryBank: Enhancing Large Language Models with Long-Term Memory , author=. 2023 , eprint=

  9. [9]

    2024 , eprint=

    Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization , author=. 2024 , eprint=

  10. [10]

    2026 , eprint=

    Test-Time Adaptation for LLM Agents via Environment Interaction , author=. 2026 , eprint=

  11. [11]

    2023 , eprint=

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. 2023 , eprint=

  12. [12]

    2023 , eprint=

    Reasoning with Language Model is Planning with World Model , author=. 2023 , eprint=

  13. [13]

    2024 , eprint=

    Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models , author=. 2024 , eprint=

  14. [14]

    2026 , eprint=

    Tree Search for Language Model Agents , author=. 2026 , eprint=

  15. [15]

    2024 , eprint=

    Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism , author=. 2024 , eprint=

  16. [16]

    2025 , eprint=

    Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search , author=. 2025 , eprint=

  17. [17]

    2025 , eprint=

    Active Exploration via Autoregressive Generation of Missing Data , author=. 2025 , eprint=

  18. [18]

    2025 , eprint=

    Contextual Thompson Sampling via Generation of Missing Data , author=. 2025 , eprint=

  19. [19]

    2025 , eprint=

    Regret Bounds for Adversarial Contextual Bandits with General Function Approximation and Delayed Feedback , author=. 2025 , eprint=

  20. [20]

    2023 , eprint=

    CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization , author=. 2023 , eprint=

  21. [21]

    Li, Lihong and Chu, Wei and Langford, John and Schapire, Robert E. , year=. A contextual-bandit approach to personalized news article recommendation , url=. doi:10.1145/1772690.1772758 , booktitle=

  22. [22]

    2023 , eprint=

    Neural Contextual Bandits for Personalized Recommendation , author=. 2023 , eprint=

  23. [23]

    2023 , eprint=

    On the Tool Manipulation Capability of Open-source Large Language Models , author=. 2023 , eprint=

  24. [24]

    2024 , eprint=

    TaskBench: Benchmarking Large Language Models for Task Automation , author=. 2024 , eprint=

  25. [25]

    2023 , eprint=

    Gorilla: Large Language Model Connected with Massive APIs , author=. 2023 , eprint=

  26. [26]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  27. [27]

    2023 , eprint=

    Mistral 7B , author=. 2023 , eprint=

  28. [28]

    2023 , eprint=

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

  29. [29]

    The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009

    Robertson, Stephen and Zaragoza, Hugo , title =. 2009 , issue_date =. doi:10.1561/1500000019 , journal =

  30. [30]

    The Twelfth International Conference on Learning Representations , year=

    Gaia: a benchmark for general ai assistants , author=. The Twelfth International Conference on Learning Representations , year=

  31. [31]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Webwalker: Benchmarking llms in web traversal , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  32. [32]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  33. [33]

    Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

    Agent skills for large language models: Architecture, acquisition, security, and the path forward , author=. arXiv preprint arXiv:2602.12430 , year=

  34. [34]

    SoK: Agentic skills–beyond tool use in LLM agents.arXiv preprint arXiv:2602.20867, 2026

    SoK: Agentic Skills--Beyond Tool Use in LLM Agents , author=. arXiv preprint arXiv:2602.20867 , year=

  35. [35]

    Cua-skill: Develop skills for computer using agent.arXiv preprint arXiv:2601.21123,

    CUA-Skill: Develop Skills for Computer Using Agent , author=. arXiv preprint arXiv:2601.21123 , year=

  36. [36]

    arXiv preprint arXiv:2507.21046 , year=

    A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence , author=. arXiv preprint arXiv:2507.21046 , year=

  37. [37]

    arXiv preprint arXiv:2508.16153 , year=

    Memento: Fine-tuning llm agents without fine-tuning llms , author=. arXiv preprint arXiv:2508.16153 , year=

  38. [38]

    Memento-skills: Let agents design agents

    Memento-Skills: Let Agents Design Agents , author=. arXiv preprint arXiv:2603.18743 , year=

  39. [39]

    arXiv preprint arXiv:2603.02176 , year=

    Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale , author=. arXiv preprint arXiv:2603.02176 , year=

  40. [40]

    arXiv preprint arXiv:2603.00718 , year=

    SkillCraft: Can LLM Agents Learn to Use Tools Skillfully? , author=. arXiv preprint arXiv:2603.00718 , year=

  41. [41]

    arXiv preprint arXiv:2603.12056 (2026)

    XSkill: Continual Learning from Experience and Skills in Multimodal Agents , author=. arXiv preprint arXiv:2603.12056 , year=

  42. [42]

    Spatialagent: An autonomous ai agent for spatial biology.bioRxiv, pp

    Reinforcement learning for self-improving agent with skill library , author=. arXiv preprint arXiv:2512.17102 , year=

  43. [43]

    SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

    Skillrl: Evolving agents via recursive skill-augmented reinforcement learning , author=. arXiv preprint arXiv:2602.08234 , year=

  44. [44]

    SEAgent: Self- evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025

    Seagent: Self-evolving computer use agent with autonomous learning from experience , author=. arXiv preprint arXiv:2508.04700 , year=

  45. [45]

    arXiv preprint arXiv:2602.03025 , year=

    RC-GRPO: Reward-Conditioned Group Relative Policy Optimization for Multi-Turn Tool Calling Agents , author=. arXiv preprint arXiv:2602.03025 , year=

  46. [46]

    2026 , eprint=

    Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills , author=. 2026 , eprint=

  47. [47]

    MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

    MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents , author=. arXiv preprint arXiv:2602.02474 , year=

  48. [48]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

    Skills-in-context: Unlocking compositionality in large language models , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

  49. [49]

    Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su

    Skillweaver: Web agents can self-improve by discovering and honing skills , author=. arXiv preprint arXiv:2504.07079 , year=

  50. [50]

    arXiv preprint arXiv:2504.06821 , year=

    Inducing programmatic skills for agentic tasks , author=. arXiv preprint arXiv:2504.06821 , year=

  51. [51]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  52. [52]

    arXiv preprint arXiv:2509.25717 , year=

    Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization , author=. arXiv preprint arXiv:2509.25717 , year=

  53. [53]

    Second Conference on Language Modeling , year=

    A Survey on Personalized and Pluralistic Preference Alignment in Large Language Models , author=. Second Conference on Language Modeling , year=

  54. [54]

    Advances in neural information processing systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

  55. [55]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Listwise Preference Diffusion Optimization for User Behavior Trajectories Prediction , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  56. [56]

    arXiv preprint arXiv:2506.15757 , year=

    Weakly-supervised VLM-guided Partial Contrastive Learning for Visual Language Navigation , author=. arXiv preprint arXiv:2506.15757 , year=

  57. [57]

    arXiv preprint arXiv:2504.15476 , year=

    From Reviews to Dialogues: Active Synthesis for Zero-Shot LLM-based Conversational Recommender System , author=. arXiv preprint arXiv:2504.15476 , year=

  58. [58]

    arXiv preprint arXiv:2601.05600 , year=

    SceneAlign: Aligning Multimodal Reasoning to Scene Graphs in Complex Visual Scenes , author=. arXiv preprint arXiv:2601.05600 , year=

  59. [59]

    Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics

    Yu, Sheldon and Xiong, Yuxin and Wu, Junda and Li, Xintong and Yu, Tong and Chen, Xiang and Sinha, Ritwik and Shang, Jingbo and McAuley, Julian. Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.904

  60. [60]

    CTRLS: Chain-of-Thought Reasoning via Latent State-Transition , author=

  61. [61]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  62. [62]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Decot: Debiasing chain-of-thought for knowledge-intensive tasks in large language models via causal intervention , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  63. [63]

    International Conference on Learning Representations , volume=

    Ocean: Offline chain-of-thought evaluation and alignment in large language models , author=. International Conference on Learning Representations , volume=

  64. [64]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

    Doc-react: Multi-page heterogeneous document question-answering , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

  65. [65]

    Proceedings of the 30th ACM International Conference on Information & Knowledge Management , pages=

    Clustering of conversational bandits for user preference learning and elicitation , author=. Proceedings of the 30th ACM International Conference on Information & Knowledge Management , pages=

  66. [66]

    arXiv preprint arXiv:2509.19333 , year=

    Pluralistic Off-policy Evaluation and Alignment , author=. arXiv preprint arXiv:2509.19333 , year=

  67. [67]

    Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval , pages=

    Dynamics-aware adaptation for reinforcement learning based cross-domain interactive recommendation , author=. Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval , pages=

  68. [68]

    Rossi and Tong Yu and Junda Wu and Handong Zhao and Sungchul Kim and Shuai Li , booktitle=

    Songwen Hu and Ryan A. Rossi and Tong Yu and Junda Wu and Handong Zhao and Sungchul Kim and Shuai Li , booktitle=. Interactive Visualization Recommendation with Hier-. 2025 , url=

  69. [69]

    Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining , pages=

    User-regulation deconfounded conversational recommender system with bandit feedback , author=. Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining , pages=

  70. [70]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    SAND: Boosting LLM agents with self-taught action deliberation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  71. [71]

    arXiv preprint arXiv:2507.23554 , year=

    Dice: Dynamic in-context example selection in llm agents via efficient knowledge transfer , author=. arXiv preprint arXiv:2507.23554 , year=

  72. [72]

    Active learning for direct preference optimization

    Active learning for direct preference optimization , author=. arXiv preprint arXiv:2503.01076 , year=

  73. [73]

    Image Difference Captioning via Adversarial Preference Optimization

    Huang, Zihan and Wu, Junda and Surana, Rohan and Yu, Tong and Arbour, David and Sinha, Ritwik and McAuley, Julian. Image Difference Captioning via Adversarial Preference Optimization. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1713

  74. [74]

    Second Conference on Language Modeling , year=

    Traceable and Explainable Multimodal Large Language Models: An Information-Theoretic View , author=. Second Conference on Language Modeling , year=

  75. [75]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    InfoPrompt: Information-Theoretic Soft Prompt Tuning for Natural Language Understanding , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  76. [76]

    Context-aware Information-theoretic Causal De-biasing for Interactive Sequence Labeling

    Wu, Junda and Wang, Rui and Yu, Tong and Zhang, Ruiyi and Zhao, Handong and Li, Shuai and Henao, Ricardo and Nenkova, Ani. Context-aware Information-theoretic Causal De-biasing for Interactive Sequence Labeling. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. doi:10.18653/v1/2022.findings-emnlp.251

  77. [77]

    Xia, Yu and Mukherjee, Subhojyoti and Xie, Zhouhang and Wu, Junda and Li, Xintong and Aponte, Ryan and Lyu, Hanjia and Barrow, Joe and Chen, Hongjie and Dernoncourt, Franck and Kveton, Branislav and Yu, Tong and Zhang, Ruiyi and Gu, Jiuxiang and Ahmed, Nesreen K. and Wang, Yu and Chen, Xiang and Deilamsalehy, Hanieh and Kim, Sungchul and Hu, Zhengmian and...

  78. [78]

    arXiv preprint arXiv:2412.02142 , year=

    Personalized multimodal large language models: A survey , author=. arXiv preprint arXiv:2412.02142 , year=

  79. [79]

    arXiv preprint arXiv:2411.00027 , year=

    Personalization of large language models: A survey , author=. arXiv preprint arXiv:2411.00027 , year=

  80. [80]

    2026 , eprint=

    Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning , author=. 2026 , eprint=

Showing first 80 references.