arxiv: 2605.11169 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: no theorem link

OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents

Jiawei Han, Jingbo Shang, Julian McAuley, Junda Wu, Nikki Lijing Kuang, Sheldon Yu, Sizhe Zhou, Tong Yu, Xintong Li

Pith reviewed 2026-05-13 02:20 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsReActonline adaptationcontextual banditsinference-time learningaction selectiondecision making

0 comments

The pith

OLIVIA models the action-selection step of ReAct LLM agents as a contextual linear bandit over frozen hidden states to enable direct online updates from feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ReAct agents interleave reasoning and tool calls but can accumulate small action errors across repeated tasks in deployment. Existing fixes rely on prompting or retrieval that leave no explicit, updatable decision layer. OLIVIA instead treats the final action choice as a contextual linear bandit whose contexts are the LLM's frozen hidden states; upper-confidence-bound selection then drives lightweight, uncertainty-aware updates after each action outcome. The approach leaves the underlying reasoning process untouched while supplying trackable adaptation and explicit uncertainty estimates. Experiments on four benchmarks show consistent gains over both static ReAct and prompt-based inference-time baselines.

Core claim

By representing the LLM's final action-selection layer as a contextual linear bandit whose contexts are frozen hidden states, OLIVIA supplies an explicit decision interface that admits online linear updates from action-level feedback and upper-confidence-bound exploration while preserving the original reasoning trace.

What carries the argument

contextual linear bandit whose contexts are the LLM's frozen hidden states and whose arms are candidate actions, updated online with upper-confidence-bound selection

If this is right

Task performance improves consistently over static ReAct and prompt-based baselines on four benchmarks.
Policy updates occur sample-efficiently with only lightweight linear updates and no retraining of the underlying LLM.
Adaptation remains trackable because each update is an explicit change to the bandit parameters rather than opaque prompt edits.
Uncertainty estimates are available at every action choice because the bandit maintains explicit variance terms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same frozen-state bandit construction could be attached to other agent scaffolds that expose an action-selection interface.
If hidden states prove linearly insufficient on harder domains, the framework would naturally motivate richer context features or nonlinear bandits.
Repeated deployment on related tasks could produce a growing library of bandit parameters that transfer across similar subtasks without prompt engineering.

Load-bearing premise

The frozen hidden states contain enough information for a linear model to recover action values and uncertainty without substantial loss from the preceding reasoning steps.

What would settle it

On any of the four benchmarks, OLIVIA produces success rates or completion times statistically indistinguishable from or worse than the static ReAct baseline when the same prompts and LLM are used.

Figures

Figures reproduced from arXiv: 2605.11169 by Jiawei Han, Jingbo Shang, Julian McAuley, Junda Wu, Nikki Lijing Kuang, Sheldon Yu, Sizhe Zhou, Tong Yu, Xintong Li.

**Figure 1.** Figure 1: Overview of OLIVIA. Left: A frozen LLM backbone processes the task and trajectory prefix through its transformer layers, producing hidden states and a reasoning trace (Thought). Right: At each action-selection step, OLIVIA extracts the last-layer hidden state as the decision context and scores candidate tools using per-action UCB estimates. (Wu et al., a), and parameter-internalized context (Wang et al., 2… view at source ↗

**Figure 2.** Figure 2: Running-average F1 over the episode stream on the four benchmarks. Static [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Synthetic experiments. Left: cumulative regret over rounds. Right: parameter esti [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Qwen3-4B on ToolBench. 0 100 200 300 400 500 Step 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 0.225 Value Qwen / Taskbench BM25 C2S CoT ReAct CLIn [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 6.** Figure 6: Qwen3-4B on TaskBench-MM. 0 100 200 300 400 500 Step 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Value Qwen / Bfcl BM25 C2S CoT ReAct CLIn [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 8.** Figure 8: Mistral-7B on ToolBench. 0 100 200 300 400 500 Step 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 0.225 Value Mistral / Taskbench BM25 C2S CoT ReAct CLIn [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 10.** Figure 10: Mistral-7B on TaskBench-MM. 0 100 200 300 400 500 Step 0.10 0.15 0.20 0.25 0.30 0.35 Value Mistral / Bfcl BM25 C2S CoT ReAct CLIn [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 12.** Figure 12: Reward Studies. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

read the original abstract

Large language model agents interleave reasoning, action selection, and observation to solve sequential decision-making tasks. In deployed settings where agents repeatedly handle related multi-step tasks, small action-selection errors can accumulate into wasted tool calls, latency, and reduced reliability. Despite this need for deployment-time improvement, existing inference-time adaptation methods for LLM agents mainly rely on prompting or retrieval, which influence behavior indirectly through context manipulation. For ReAct-style agents, such approaches do not expose an explicit decision layer that can score candidate actions, represent uncertainty, or be updated online from action-level feedback. As a result, they provide limited support for trackable, fine-grained, and uncertainty-aware adaptation during deployment. We propose OLIVIA, an inference-time action adaptation framework for ReAct-style agents. OLIVIA models the LLM's final action-selection layer as a contextual linear bandit over candidate actions, with frozen hidden states as decision contexts. This choice is particularly suitable for deployment because it adapts behavior directly at the action-selection interface, preserves the underlying reasoning process, and provides explicit uncertainty estimates and lightweight online updates from action-level feedback. With upper-confidence-bound exploration, OLIVIA improves the policy sample-efficiently with minimal computational overhead. We instantiate OLIVIA on four benchmarks and show that it consistently improves task performance over static ReAct and prompt-based inference-time baselines. Our results suggest that explicit online decision layers provide an effective alternative to purely prompt- or retrieval-based adaptation for LLM agents during deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OLIVIA frames ReAct action selection as a contextual linear bandit on the final frozen hidden state with UCB updates, which delivers reported gains over baselines but hinges on how well that state encodes the prior reasoning.

read the letter

OLIVIA models the action selection step in ReAct LLM agents as a contextual linear bandit, with the frozen final hidden state serving as the context and UCB driving online updates from action feedback. This framing stands out as a direct way to adapt during deployment. The paper does a solid job of positioning this against prompt-based or retrieval methods, which act more indirectly. By exposing an explicit decision layer, it allows for uncertainty-aware updates that are lightweight and preserve the original reasoning chain. The reported consistent improvements over baselines on four benchmarks suggest this can help reduce errors in repeated sequential tasks. One soft spot is the core assumption that the last hidden state captures the necessary task information without much loss. Reasoning steps produce a sequence of tokens, and their value might not be fully summarized in the final representation. The linear bandit model then has to work with whatever is left, which could lead to poor uncertainty estimates or ineffective exploration. The abstract does not detail ablations or how they validated the state choice, so it is difficult to rule out that the gains come from implementation details rather than the bandit itself. This work is for people developing LLM agents for practical, ongoing use cases where online adaptation matters. Readers focused on inference-time techniques or bandit applications to language models would get the most out of it. It deserves a serious referee because the proposal is clear and the empirical claims, if backed by proper experiments in the full paper, would add a useful tool to the area.

Referee Report

2 major / 2 minor

Summary. The paper proposes OLIVIA, an inference-time adaptation method for ReAct-style LLM agents. It treats the final hidden state before action selection as the context vector for a contextual linear bandit over candidate actions, maintains per-action parameters, and applies UCB exploration with online updates from action-level feedback. The central claim is that this yields consistent performance gains on four benchmarks relative to static ReAct and prompt-based inference-time baselines while preserving the original reasoning process and adding explicit uncertainty estimates at low computational cost.

Significance. If the linear-bandit construction on frozen hidden states proves reliable, the work supplies a lightweight, uncertainty-aware, and directly updatable decision layer for deployed LLM agents. This is a meaningful alternative to purely prompt- or retrieval-based adaptation, especially for repeated multi-step tasks where small action errors accumulate. The approach is notable for its explicit modeling of action values and uncertainty rather than indirect context manipulation.

major comments (2)

[Method description (abstract and §3)] The headline performance claim rests on the assumption that the LLM's final hidden state before action selection encodes all task-relevant information produced by preceding ReAct reasoning steps and that the mapping from this state to expected reward is approximately linear. Neither property is guaranteed: reasoning information may be distributed across the token sequence rather than concentrated in the last token, and LLM representations often require non-linear probes for downstream quantities. If either fails, the bandit receives noisy or biased targets, UCB exploration becomes ineffective, and observed gains could be explained by prompt sensitivity or benchmark idiosyncrasies rather than the adaptation mechanism.
[Experiments (§5)] The experimental section asserts consistent improvements on four benchmarks but supplies no statistical significance tests, ablation results isolating the bandit component, details on hidden-state extraction or dimensionality, or implementation specifics for the UCB updates and reward signals. Without these, it is impossible to verify that gains are attributable to the proposed online decision layer rather than other factors.

minor comments (2)

[§3] Notation for the context vector x_t and per-action parameters should be introduced with explicit equations rather than prose description to improve reproducibility.
[§4] The paper would benefit from a short discussion of how action-level feedback is obtained in each benchmark (e.g., success/failure signals or reward definitions).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed comments on our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions made to strengthen the paper.

read point-by-point responses

Referee: [Method description (abstract and §3)] The headline performance claim rests on the assumption that the LLM's final hidden state before action selection encodes all task-relevant information produced by preceding ReAct reasoning steps and that the mapping from this state to expected reward is approximately linear. Neither property is guaranteed: reasoning information may be distributed across the token sequence rather than concentrated in the last token, and LLM representations often require non-linear probes for downstream quantities. If either fails, the bandit receives noisy or biased targets, UCB exploration becomes ineffective, and observed gains could be explained by prompt sensitivity or benchmark idiosyncrasies rather than the adaptation mechanism.

Authors: We acknowledge the validity of this concern regarding the assumptions underlying the contextual linear bandit construction. While the final hidden state is not theoretically guaranteed to concentrate all task-relevant information (as reasoning steps may distribute information across tokens), the ReAct paradigm explicitly generates and interleaves reasoning traces immediately prior to action selection, making the last hidden state a natural and commonly used context representation in LLM decision-making literature. We have revised Section 3 to include an expanded discussion of this design choice, its motivations, and potential limitations, along with references to related work employing last-token representations. On linearity, we do not assert it holds universally but note that the linear model enables lightweight, uncertainty-aware updates suitable for deployment; the consistent empirical gains over strong baselines (including prompt-based methods that control for indirect context effects) suggest the approximation is effective in practice. We have also added further controls in the experiments (see response to the second comment) to help attribute gains to the adaptation mechanism rather than idiosyncrasies. revision: partial
Referee: [Experiments (§5)] The experimental section asserts consistent improvements on four benchmarks but supplies no statistical significance tests, ablation results isolating the bandit component, details on hidden-state extraction or dimensionality, or implementation specifics for the UCB updates and reward signals. Without these, it is impossible to verify that gains are attributable to the proposed online decision layer rather than other factors.

Authors: We agree that the original experimental section was insufficiently detailed to allow full verification of the source of the reported gains. In the revised manuscript, we have substantially expanded Section 5 and the appendix with the following: statistical significance testing via paired t-tests and bootstrap confidence intervals across multiple random seeds, with p-values reported for all comparisons; ablation studies that isolate the bandit components (e.g., static linear model without UCB exploration, non-updating parameters, and random action selection); explicit details on hidden-state extraction (last token of the final reasoning step prior to action selection, using the model's native hidden dimension such as 4096 for the evaluated LLMs); and full implementation specifics for UCB (exploration coefficient α = 1.0, online ridge-regression updates, and binary reward signals derived from action-level success or task completion). These additions directly address the referee's points and enable readers to confirm that improvements arise from the online decision layer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard contextual bandit applied to LLM states with empirical validation

full rationale

The paper introduces OLIVIA by directly defining the action-selection layer as a contextual linear bandit using frozen hidden states as contexts and applying UCB for online updates from action feedback. This is an application of an existing algorithm (contextual bandits) to new inputs (LLM hidden states), not a derivation that reduces claimed improvements to fitted parameters or self-referential definitions by construction. Performance gains are shown via experiments on four benchmarks against baselines, with no load-bearing self-citations, uniqueness theorems, or ansatzes that collapse the result to its inputs. The chain remains self-contained as a proposed framework plus empirical demonstration.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLM hidden states serve as adequate linear contexts for action values and on the standard UCB algorithm; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Frozen LLM hidden states can be treated as contexts for a linear model of action values.
This is the modeling choice that allows the bandit formulation at the action-selection interface.

pith-pipeline@v0.9.0 · 5596 in / 1243 out tokens · 61508 ms · 2026-05-13T02:20:03.646401+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

108 extracted references · 108 canonical work pages · 5 internal anchors

[1]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page
[2]

2025 , eprint=

Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity , author=. 2025 , eprint=

work page 2025
[3]

2023 , eprint=

RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning by Reversing Chain-of-Thought , author=. 2023 , eprint=

work page 2023
[4]

2023 , eprint=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

work page 2023
[5]

2023 , eprint=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. 2023 , eprint=

work page 2023
[6]

2024 , eprint=

ExpeL: LLM Agents Are Experiential Learners , author=. 2024 , eprint=

work page 2024
[7]

2023 , eprint=

Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. 2023 , eprint=

work page 2023
[8]

2023 , eprint=

MemoryBank: Enhancing Large Language Models with Long-Term Memory , author=. 2023 , eprint=

work page 2023
[9]

2024 , eprint=

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization , author=. 2024 , eprint=

work page 2024
[10]

2026 , eprint=

Test-Time Adaptation for LLM Agents via Environment Interaction , author=. 2026 , eprint=

work page 2026
[11]

2023 , eprint=

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. 2023 , eprint=

work page 2023
[12]

2023 , eprint=

Reasoning with Language Model is Planning with World Model , author=. 2023 , eprint=

work page 2023
[13]

2024 , eprint=

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models , author=. 2024 , eprint=

work page 2024
[14]

2026 , eprint=

Tree Search for Language Model Agents , author=. 2026 , eprint=

work page 2026
[15]

2024 , eprint=

Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism , author=. 2024 , eprint=

work page 2024
[16]

2025 , eprint=

Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search , author=. 2025 , eprint=

work page 2025
[17]

2025 , eprint=

Active Exploration via Autoregressive Generation of Missing Data , author=. 2025 , eprint=

work page 2025
[18]

2025 , eprint=

Contextual Thompson Sampling via Generation of Missing Data , author=. 2025 , eprint=

work page 2025
[19]

2025 , eprint=

Regret Bounds for Adversarial Contextual Bandits with General Function Approximation and Delayed Feedback , author=. 2025 , eprint=

work page 2025
[20]

2023 , eprint=

CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization , author=. 2023 , eprint=

work page 2023
[21]

Li, Lihong and Chu, Wei and Langford, John and Schapire, Robert E. , year=. A contextual-bandit approach to personalized news article recommendation , url=. doi:10.1145/1772690.1772758 , booktitle=

work page doi:10.1145/1772690.1772758
[22]

2023 , eprint=

Neural Contextual Bandits for Personalized Recommendation , author=. 2023 , eprint=

work page 2023
[23]

2023 , eprint=

On the Tool Manipulation Capability of Open-source Large Language Models , author=. 2023 , eprint=

work page 2023
[24]

2024 , eprint=

TaskBench: Benchmarking Large Language Models for Task Automation , author=. 2024 , eprint=

work page 2024
[25]

2023 , eprint=

Gorilla: Large Language Model Connected with Massive APIs , author=. 2023 , eprint=

work page 2023
[26]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[27]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

work page 2023
[28]

2023 , eprint=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

work page 2023
[29]

The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009

Robertson, Stephen and Zaragoza, Hugo , title =. 2009 , issue_date =. doi:10.1561/1500000019 , journal =

work page doi:10.1561/1500000019 2009
[30]

The Twelfth International Conference on Learning Representations , year=

Gaia: a benchmark for general ai assistants , author=. The Twelfth International Conference on Learning Representations , year=

work page
[31]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Webwalker: Benchmarking llms in web traversal , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Agent skills for large language models: Architecture, acquisition, security, and the path forward , author=. arXiv preprint arXiv:2602.12430 , year=

work page internal anchor Pith review arXiv
[34]

SoK: Agentic skills–beyond tool use in LLM agents.arXiv preprint arXiv:2602.20867, 2026

SoK: Agentic Skills--Beyond Tool Use in LLM Agents , author=. arXiv preprint arXiv:2602.20867 , year=

work page arXiv
[35]

Cua-skill: Develop skills for computer using agent.arXiv preprint arXiv:2601.21123,

CUA-Skill: Develop Skills for Computer Using Agent , author=. arXiv preprint arXiv:2601.21123 , year=

work page arXiv
[36]

arXiv preprint arXiv:2507.21046 , year=

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence , author=. arXiv preprint arXiv:2507.21046 , year=

work page arXiv
[37]

arXiv preprint arXiv:2508.16153 , year=

Memento: Fine-tuning llm agents without fine-tuning llms , author=. arXiv preprint arXiv:2508.16153 , year=

work page arXiv
[38]

Memento-skills: Let agents design agents

Memento-Skills: Let Agents Design Agents , author=. arXiv preprint arXiv:2603.18743 , year=

work page arXiv
[39]

arXiv preprint arXiv:2603.02176 , year=

Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale , author=. arXiv preprint arXiv:2603.02176 , year=

work page arXiv
[40]

arXiv preprint arXiv:2603.00718 , year=

SkillCraft: Can LLM Agents Learn to Use Tools Skillfully? , author=. arXiv preprint arXiv:2603.00718 , year=

work page arXiv
[41]

arXiv preprint arXiv:2603.12056 (2026)

XSkill: Continual Learning from Experience and Skills in Multimodal Agents , author=. arXiv preprint arXiv:2603.12056 , year=

work page arXiv
[42]

Spatialagent: An autonomous ai agent for spatial biology.bioRxiv, pp

Reinforcement learning for self-improving agent with skill library , author=. arXiv preprint arXiv:2512.17102 , year=

work page arXiv
[43]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Skillrl: Evolving agents via recursive skill-augmented reinforcement learning , author=. arXiv preprint arXiv:2602.08234 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

SEAgent: Self- evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025

Seagent: Self-evolving computer use agent with autonomous learning from experience , author=. arXiv preprint arXiv:2508.04700 , year=

work page arXiv
[45]

arXiv preprint arXiv:2602.03025 , year=

RC-GRPO: Reward-Conditioned Group Relative Policy Optimization for Multi-Turn Tool Calling Agents , author=. arXiv preprint arXiv:2602.03025 , year=

work page arXiv
[46]

2026 , eprint=

Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills , author=. 2026 , eprint=

work page 2026
[47]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents , author=. arXiv preprint arXiv:2602.02474 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Skills-in-context: Unlocking compositionality in large language models , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

work page 2024
[49]

Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su

Skillweaver: Web agents can self-improve by discovering and honing skills , author=. arXiv preprint arXiv:2504.07079 , year=

work page arXiv
[50]

arXiv preprint arXiv:2504.06821 , year=

Inducing programmatic skills for agentic tasks , author=. arXiv preprint arXiv:2504.06821 , year=

work page arXiv
[51]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

arXiv preprint arXiv:2509.25717 , year=

Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization , author=. arXiv preprint arXiv:2509.25717 , year=

work page arXiv
[53]

Second Conference on Language Modeling , year=

A Survey on Personalized and Pluralistic Preference Alignment in Large Language Models , author=. Second Conference on Language Modeling , year=

work page
[54]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page
[55]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Listwise Preference Diffusion Optimization for User Behavior Trajectories Prediction , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[56]

arXiv preprint arXiv:2506.15757 , year=

Weakly-supervised VLM-guided Partial Contrastive Learning for Visual Language Navigation , author=. arXiv preprint arXiv:2506.15757 , year=

work page arXiv
[57]

arXiv preprint arXiv:2504.15476 , year=

From Reviews to Dialogues: Active Synthesis for Zero-Shot LLM-based Conversational Recommender System , author=. arXiv preprint arXiv:2504.15476 , year=

work page arXiv
[58]

arXiv preprint arXiv:2601.05600 , year=

SceneAlign: Aligning Multimodal Reasoning to Scene Graphs in Complex Visual Scenes , author=. arXiv preprint arXiv:2601.05600 , year=

work page arXiv
[59]

Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics

Yu, Sheldon and Xiong, Yuxin and Wu, Junda and Li, Xintong and Yu, Tong and Chen, Xiang and Sinha, Ritwik and Shang, Jingbo and McAuley, Julian. Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.904

work page doi:10.18653/v1/2025.findings-emnlp.904 2025
[60]

CTRLS: Chain-of-Thought Reasoning via Latent State-Transition , author=

work page
[61]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[62]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Decot: Debiasing chain-of-thought for knowledge-intensive tasks in large language models via causal intervention , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[63]

International Conference on Learning Representations , volume=

Ocean: Offline chain-of-thought evaluation and alignment in large language models , author=. International Conference on Learning Representations , volume=

work page
[64]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

Doc-react: Multi-page heterogeneous document question-answering , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

work page
[65]

Proceedings of the 30th ACM International Conference on Information & Knowledge Management , pages=

Clustering of conversational bandits for user preference learning and elicitation , author=. Proceedings of the 30th ACM International Conference on Information & Knowledge Management , pages=

work page
[66]

arXiv preprint arXiv:2509.19333 , year=

Pluralistic Off-policy Evaluation and Alignment , author=. arXiv preprint arXiv:2509.19333 , year=

work page arXiv
[67]

Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval , pages=

Dynamics-aware adaptation for reinforcement learning based cross-domain interactive recommendation , author=. Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval , pages=

work page
[68]

Rossi and Tong Yu and Junda Wu and Handong Zhao and Sungchul Kim and Shuai Li , booktitle=

Songwen Hu and Ryan A. Rossi and Tong Yu and Junda Wu and Handong Zhao and Sungchul Kim and Shuai Li , booktitle=. Interactive Visualization Recommendation with Hier-. 2025 , url=

work page 2025
[69]

Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining , pages=

User-regulation deconfounded conversational recommender system with bandit feedback , author=. Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining , pages=

work page
[70]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

SAND: Boosting LLM agents with self-taught action deliberation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[71]

arXiv preprint arXiv:2507.23554 , year=

Dice: Dynamic in-context example selection in llm agents via efficient knowledge transfer , author=. arXiv preprint arXiv:2507.23554 , year=

work page arXiv
[72]

Active learning for direct preference optimization

Active learning for direct preference optimization , author=. arXiv preprint arXiv:2503.01076 , year=

work page arXiv
[73]

Image Difference Captioning via Adversarial Preference Optimization

Huang, Zihan and Wu, Junda and Surana, Rohan and Yu, Tong and Arbour, David and Sinha, Ritwik and McAuley, Julian. Image Difference Captioning via Adversarial Preference Optimization. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1713

work page doi:10.18653/v1/2025.emnlp-main.1713 2025
[74]

Second Conference on Language Modeling , year=

Traceable and Explainable Multimodal Large Language Models: An Information-Theoretic View , author=. Second Conference on Language Modeling , year=

work page
[75]

Thirty-seventh Conference on Neural Information Processing Systems , year=

InfoPrompt: Information-Theoretic Soft Prompt Tuning for Natural Language Understanding , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[76]

Context-aware Information-theoretic Causal De-biasing for Interactive Sequence Labeling

Wu, Junda and Wang, Rui and Yu, Tong and Zhang, Ruiyi and Zhao, Handong and Li, Shuai and Henao, Ricardo and Nenkova, Ani. Context-aware Information-theoretic Causal De-biasing for Interactive Sequence Labeling. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. doi:10.18653/v1/2022.findings-emnlp.251

work page doi:10.18653/v1/2022.findings-emnlp.251 2022
[77]

Xia, Yu and Mukherjee, Subhojyoti and Xie, Zhouhang and Wu, Junda and Li, Xintong and Aponte, Ryan and Lyu, Hanjia and Barrow, Joe and Chen, Hongjie and Dernoncourt, Franck and Kveton, Branislav and Yu, Tong and Zhang, Ruiyi and Gu, Jiuxiang and Ahmed, Nesreen K. and Wang, Yu and Chen, Xiang and Deilamsalehy, Hanieh and Kim, Sungchul and Hu, Zhengmian and...

work page doi:10.18653/v1/2025.acl-long.708 2025
[78]

arXiv preprint arXiv:2412.02142 , year=

Personalized multimodal large language models: A survey , author=. arXiv preprint arXiv:2412.02142 , year=

work page arXiv
[79]

arXiv preprint arXiv:2411.00027 , year=

Personalization of large language models: A survey , author=. arXiv preprint arXiv:2411.00027 , year=

work page arXiv
[80]

2026 , eprint=

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning , author=. 2026 , eprint=

work page 2026

Showing first 80 references.