ThinC trains small models to reason primarily in code rather than natural language, outperforming tool-integrated baselines and even larger models on competition math benchmarks.
hub
Pal: Program-aided language models
19 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
Homogeneous multi-agent debate introduces sycophantic conformity, contextual fragility, and consensus collapse, leading to equal or lower accuracy than isolated self-correction at 2.1-3.4x higher token cost on GSM-Hard and MMLU-Hard.
LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with control flow and high-precision literals.
LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
GRIEF fuzzer finds 15 vulnerabilities including 2 CVEs in vLLM and SGLang by testing concurrent workloads for KV-cache isolation failures and cross-request interference.
TraceLift trains reasoning planners using rewards that credit traces for both rubric quality and actual performance gains on a frozen executor, outperforming final-answer-only training on math and code tasks.
QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperforming baselines.
ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.
An adaptive compute-optimal strategy for scaling LLM test-time compute achieves over 4x efficiency gains versus best-of-N and lets smaller models outperform 14x larger ones on some problems.
Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.
LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.
SciFi is a safe, lightweight agentic AI framework that automates structured scientific tasks with minimal human intervention via isolated environments and layered self-assessing agents.
citing papers explorer
-
Teaching Language Models to Think in Code
ThinC trains small models to reason primarily in code rather than natural language, outperforming tool-integrated baselines and even larger models on competition math benchmarks.
-
The Cost of Consensus: Isolated Self-Correction Prevails Over Unguided Homogeneous Multi-Agent Debate
Homogeneous multi-agent debate introduces sycophantic conformity, contextual fragility, and consensus collapse, leading to equal or lower accuracy than isolated self-correction at 2.1-3.4x higher token cost on GSM-Hard and MMLU-Hard.
-
Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software
LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with control flow and high-precision literals.
-
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
-
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
-
Continuous Discovery of Vulnerabilities in LLM Serving Systems with Fuzzing
GRIEF fuzzer finds 15 vulnerabilities including 2 CVEs in vLLM and SGLang by testing concurrent workloads for KV-cache isolation failures and cross-request interference.
-
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
TraceLift trains reasoning planners using rewards that credit traces for both rubric quality and actual performance gains on a frozen executor, outperforming final-answer-only training on math and code tasks.
-
Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations
QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperforming baselines.
-
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.
-
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
An adaptive compute-optimal strategy for scaling LLM test-time compute achieves over 4x efficiency gains versus best-of-N and lets smaller models outperform 14x larger ones on some problems.
-
Gorilla: Large Language Model Connected with Massive APIs
Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
-
Teaching Large Language Models to Self-Debug
Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
-
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
-
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
-
LLMs with in-context learning for Algorithmic Theoretical Physics
Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.
-
The Cartesian Cut in Agentic AI
LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
Self-Refine: Iterative Refinement with Self-Feedback
Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.
-
SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications
SciFi is a safe, lightweight agentic AI framework that automates structured scientific tasks with minimal human intervention via isolated environments and layered self-assessing agents.