hub Mixed citations

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen, Xueguang Ma, Xinyi Wang, William W. Cohen · 2022 · cs.CL · arXiv 2211.12588

Mixed citation behavior. Most common role is background (65%).

59 Pith papers citing it

Background 65% of classified citations

open full Pith review browse 59 citing papers arXiv PDF

abstract

Recently, there has been significant progress in teaching language models to perform step-by-step reasoning to solve complex numerical reasoning tasks. Chain-of-thoughts prompting (CoT) is by far the state-of-art method for these tasks. CoT uses language models to perform both reasoning and computation in the multi-step `thought' process. To disentangle computation from reasoning, we propose `Program of Thoughts' (PoT), which uses language models (mainly Codex) to express the reasoning process as a program. The computation is relegated to an external computer, which executes the generated programs to derive the answer. We evaluate PoT on five math word problem datasets (GSM, AQuA, SVAMP, TabMWP, MultiArith) and three financial-QA datasets (FinQA, ConvFinQA, TATQA) for both few-shot and zero-shot setups. Under both few-shot and zero-shot settings, PoT can show an average performance gain over CoT by around 12\% across all the evaluated datasets. By combining PoT with self-consistency decoding, we can achieve SoTA performance on all math problem datasets and near-SoTA performance on financial datasets. All of our data and code are released in Github https://github.com/wenhuchen/Program-of-Thoughts

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 16 method 6 baseline 1

citation-polarity summary

background 15 use method 6 baseline 1 support 1

claims ledger

abstract Recently, there has been significant progress in teaching language models to perform step-by-step reasoning to solve complex numerical reasoning tasks. Chain-of-thoughts prompting (CoT) is by far the state-of-art method for these tasks. CoT uses language models to perform both reasoning and computation in the multi-step `thought' process. To disentangle computation from reasoning, we propose `Program of Thoughts' (PoT), which uses language models (mainly Codex) to express the reasoning process as a program. The computation is relegated to an external computer, which executes the generated prog
background sizing a CoT process with macro actions within the rea- soning sequence can significantly improve the data effi- ciency of the reasoning chain. For instance, LLaVA-CoT [229] enhances CoT data synthesis by externalizing in- termediate reasoning steps across multiple modalities. AtomThink [231] generates the AMATH-SFT dataset using a structured g1 prompt [238], achieving supe- rior performance on long-horizon reasoning tasks com- pared to traditional CoT approaches. CoAct [239] intro- duces a dual

co-cited works

representative citing papers

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

cs.CL · 2023-10-05 · conditional · novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

PAL: Program-aided Language Models

cs.CL · 2022-11-18 · conditional · novelty 8.0

PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.

TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

TextReg mitigates prompt distributional overfitting via regularized text-space optimization, reporting up to +16.5% OOD accuracy gains over prior methods on reasoning benchmarks.

Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.

TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation

cs.CV · 2026-04-29 · accept · novelty 7.0

TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.

VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems

cs.MA · 2026-04-13 · unverdicted · novelty 7.0

VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.

Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software

cs.SE · 2026-04-06 · conditional · novelty 7.0

LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with control flow and high-precision literals.

Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation

cs.SE · 2026-04-03 · unverdicted · novelty 7.0

Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.

Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

cs.CL · 2026-02-11 · unverdicted · novelty 7.0

LLMs show heterogeneous robustness to five types of chain-of-thought perturbations, with MathError causing 50-60% accuracy loss in small models but scaling benefits, UnitConversion remaining hard across sizes, and ExtraSteps causing minimal degradation.

PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data

cs.CL · 2025-12-11 · conditional · novelty 7.0

PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with modest compute.

PRIMETIME : Limits of LLMs in Temporal Primitives

cs.NE · 2025-04-22 · unverdicted · novelty 7.0

PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

cs.AI · 2025-03-14 · conditional · novelty 7.0

Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

cs.AI · 2023-04-22 · accept · novelty 7.0

LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.

ViperGPT: Visual Inference via Python Execution for Reasoning

cs.CV · 2023-03-14 · unverdicted · novelty 7.0

ViperGPT generates executable Python code to compose pre-trained vision-and-language modules into programs that answer visual queries, reaching state-of-the-art results with no additional training.

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

Tool-calling evaluations for LLM agents are highly sensitive to implementation details such as random seeds and history handling, and two new techniques accelerate RL training with wall-clock speedup and no performance degradation.

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

cs.SE · 2026-05-28 · unverdicted · novelty 6.0

RePoT recovers from PoT failures via deterministic verified replay and checkpoint repair, yielding +3 to +11pp gains on planning benchmarks and showing checkpoint state as the key recovery signal over error-only feedback.

Unified Data Selection for LLM Reasoning

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.

TCARD: Nearly Balanced Two-Level Designs with Treatment Cardinality Constraints with an Application to LLM Prompt Engineering

stat.ME · 2026-05-20 · unverdicted · novelty 6.0

Proposes nearly balanced TCARDs that minimize the first two generalized word-length pattern components, defines Φ_BCD criterion linked to classical optimality, and constructs designs via coordinate exchange with simulation-calibrated weights for LLM prompt engineering.

PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.

Weighted Rules under the Stable Model Semantics

cs.AI · 2026-05-10 · unverdicted · novelty 6.0

Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.

PaT: Planning-after-Trial for Efficient Test-Time Code Generation

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.

HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization

cs.AR · 2026-04-20 · unverdicted · novelty 6.0

AQPIM performs in-memory product quantization of activations for LLMs on PIM hardware, reducing GPU-CPU communication by 90-98.5% and delivering 3.4x speedup over prior PIM methods.

citing papers explorer

Showing 50 of 59 citing papers.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines cs.CL · 2023-10-05 · conditional · none · ref 11 · internal anchor
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
PAL: Program-aided Language Models cs.CL · 2022-11-18 · conditional · none · ref 7 · internal anchor
PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.
TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization cs.CL · 2026-05-20 · unverdicted · none · ref 18 · internal anchor
TextReg mitigates prompt distributional overfitting via regularized text-space optimization, reporting up to +16.5% OOD accuracy gains over prior methods on reasoning benchmarks.
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation cs.AI · 2026-05-13 · unverdicted · none · ref 3 · internal anchor
PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation cs.CV · 2026-04-29 · accept · none · ref 29 · internal anchor
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems cs.MA · 2026-04-13 · unverdicted · none · ref 32 · internal anchor
VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.
Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software cs.SE · 2026-04-06 · conditional · none · ref 6 · internal anchor
LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with control flow and high-precision literals.
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation cs.SE · 2026-04-03 · unverdicted · none · ref 7 · internal anchor
Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations cs.CL · 2026-02-11 · unverdicted · none · ref 15 · internal anchor
LLMs show heterogeneous robustness to five types of chain-of-thought perturbations, with MathError causing 50-60% accuracy loss in small models but scaling benefits, UnitConversion remaining hard across sizes, and ExtraSteps causing minimal degradation.
PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data cs.CL · 2025-12-11 · conditional · none · ref 28 · internal anchor
PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with modest compute.
PRIMETIME : Limits of LLMs in Temporal Primitives cs.NE · 2025-04-22 · unverdicted · none · ref 9 · internal anchor
PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation cs.AI · 2025-03-14 · conditional · none · ref 50 · internal anchor
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cs.CL · 2024-05-07 · unverdicted · none · ref 71 · internal anchor
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency cs.AI · 2023-04-22 · accept · none · ref 63 · internal anchor
LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
ViperGPT: Visual Inference via Python Execution for Reasoning cs.CV · 2023-03-14 · unverdicted · none · ref 10 · internal anchor
ViperGPT generates executable Python code to compose pre-trained vision-and-language modules into programs that answer visual queries, reaching state-of-the-art results with no additional training.
On Effectiveness and Efficiency of Agentic Tool-calling and RL Training cs.LG · 2026-05-28 · unverdicted · none · ref 42 · internal anchor
Tool-calling evaluations for LLM agents are highly sensitive to implementation details such as random seeds and history handling, and two new techniques accelerate RL training with wall-clock speedup and no performance degradation.
REPOT: Recoverable Program-of-Thought via Checkpoint Repair cs.SE · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
RePoT recovers from PoT failures via deterministic verified replay and checkpoint repair, yielding +3 to +11pp gains on planning benchmarks and showing checkpoint state as the key recovery signal over error-only feedback.
Unified Data Selection for LLM Reasoning cs.CL · 2026-05-21 · unverdicted · none · ref 18 · internal anchor
High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.
TCARD: Nearly Balanced Two-Level Designs with Treatment Cardinality Constraints with an Application to LLM Prompt Engineering stat.ME · 2026-05-20 · unverdicted · none · ref 56 · internal anchor
Proposes nearly balanced TCARDs that minimize the first two generalized word-length pattern components, defines Φ_BCD criterion linked to classical optimality, and constructs designs via coordinate exchange with simulation-calibrated weights for LLM prompt engineering.
PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning cs.CL · 2026-05-11 · unverdicted · none · ref 15 · internal anchor
PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.
Weighted Rules under the Stable Model Semantics cs.AI · 2026-05-10 · unverdicted · none · ref 79 · internal anchor
Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.
PaT: Planning-after-Trial for Efficient Test-Time Code Generation cs.CL · 2026-05-08 · unverdicted · none · ref 9 · internal anchor
PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering cs.AI · 2026-04-22 · unverdicted · none · ref 126 · internal anchor
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization cs.AR · 2026-04-20 · unverdicted · none · ref 5 · internal anchor
AQPIM performs in-memory product quantization of activations for LLMs on PIM hardware, reducing GPU-CPU communication by 90-98.5% and delivering 3.4x speedup over prior PIM methods.
DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates cs.CV · 2026-04-17 · unverdicted · none · ref 5 · internal anchor
DenTab provides 2,000 annotated dental table images and 2,208 questions to benchmark 16 systems on table structure recognition and VQA, revealing that strong layout recovery does not ensure reliable multi-step arithmetic, and proposes a Table Router Pipeline combining VLMs with rule-based execution.
ExecTune: Effective Steering of Black-Box LLMs with Guide Models cs.LG · 2026-04-09 · unverdicted · none · ref 7 · internal anchor
ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on math and code tasks.
When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning cs.CL · 2026-04-09 · unverdicted · none · ref 3 · internal anchor
ATTC reduces 'Tool Ignored' errors in tool-integrated reasoning by adaptively trusting tool results according to generated code confidence, yielding 4.1-7.5% gains across models and datasets.
Generating Verifiable Chain of Thoughts from Exection-Traces cs.SE · 2025-11-28 · unverdicted · none · ref 5 · internal anchor
A pipeline produces 54,000 execution-trace-verified bi-directional Chain-of-Thought rationales for code, and fine-tuning on them yields gains up to 26.6 points on LiveCodeBench-Exec and similar benchmarks.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 159 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
ToolRL: Reward is All Tool Learning Needs cs.LG · 2025-04-16 · conditional · none · ref 3 · internal anchor
A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs cs.CL · 2025-04-15 · unverdicted · none · ref 1 · internal anchor
ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.
Scaling Synthetic Data Creation with 1,000,000,000 Personas cs.CL · 2024-06-28 · unverdicted · none · ref 9 · internal anchor
A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models cs.CL · 2024-02-05 · unverdicted · none · ref 8 · internal anchor
DeepSeekMath 7B reaches 51.7% on MATH via continued pretraining on curated web math data and Group Relative Policy Optimization.
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving cs.CL · 2023-09-29 · conditional · none · ref 5 · internal anchor
ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models cs.CL · 2023-09-21 · conditional · none · ref 9 · internal anchor
Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning cs.CL · 2023-09-11 · conditional · none · ref 5 · internal anchor
MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
Teaching Large Language Models to Self-Debug cs.CL · 2023-04-11 · unverdicted · none · ref 82 · internal anchor
Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
ART: Automatic multi-step reasoning and tool-use for large language models cs.CL · 2023-03-16 · unverdicted · none · ref 144 · internal anchor
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
Multimodal Chain-of-Thought Reasoning in Language Models cs.CL · 2023-02-02 · accept · none · ref 6 · internal anchor
Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.
Code as Agent Harness cs.CL · 2026-05-18 · accept · none · ref 6 · internal anchor
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR) cs.LG · 2026-05-13 · unverdicted · none · ref 39 · internal anchor
RL post-training lifts answer correctness on FHIR-AgentBench from 50% (o4-mini) to 77% with a cheaper Qwen3-8B CodeAct agent.
EGL-SCA: Structural Credit Assignment for Co-Evolving Instructions and Tools in Graph Reasoning Agents cs.AI · 2026-05-11 · unverdicted · none · ref 26 · internal anchor
EGL-SCA co-evolves instructions and tools via structural credit assignment in graph reasoning agents and reports 92% average success on four benchmarks.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work cs.AI · 2026-05-07 · conditional · none · ref 35 · internal anchor
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
LLMs with in-context learning for Algorithmic Theoretical Physics cs.LG · 2026-05-06 · unverdicted · none · ref 6 · internal anchor
Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.
Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning cs.LG · 2026-04-23 · unverdicted · none · ref 14 · internal anchor
TaNOS decouples table semantics from numerical structure via anonymization, sketches, and program-first self-supervision, yielding 80.13% FinQA accuracy with 10% data and near-zero cross-domain gap versus over 10pp for standard fine-tuning.
SinkTrack: Attention Sink based Context Anchoring for Large Language Models cs.CV · 2026-04-11 · unverdicted · none · ref 3 · 2 links · internal anchor
SinkTrack anchors LLMs to initial context by modifying the attention sink token with injected features, yielding gains on textual and multimodal tasks.
The Cartesian Cut in Agentic AI cs.AI · 2026-04-09 · unverdicted · none · ref 17 · internal anchor
LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning cs.CV · 2025-08-15 · unverdicted · none · ref 11 · internal anchor
UAV-VL-R1 combines SFT and multi-stage GRPO reinforcement learning on a new 50,019-sample HRVQA-VL dataset to deliver substantially higher zero-shot accuracy on UAV visual reasoning tasks than both its 2B baseline and a 72B-scale model.
Understanding the planning of LLM agents: A survey cs.AI · 2024-02-05 · accept · none · ref 6 · internal anchor
A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism cs.CL · 2024-01-05 · unverdicted · none · ref 82 · internal anchor
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer