NoisyCausal benchmark tests LLMs on causal reasoning with structured noise, and a modular LLM-plus-causal-graph framework outperforms baselines while generalizing to Cladder.
arXiv preprint arXiv:2305.12295 , year=
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6verdicts
UNVERDICTED 6representative citing papers
CodeClinic benchmark demonstrates that LLM-generated Python skill libraries from clinical guidelines enhance consistency and reduce token consumption by up to 40% compared to zero-shot approaches on MIMIC-IV based tasks.
A new benchmark and deterministic pipeline translate natural language reasoning into executable Narsese for NARS, with execution-based validation and initial LLM adaptation for three-label classification.
LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.
LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.
VeriTrans achieves 94.46% SAT/UNSAT correctness on SatBench via LLM translation gated by round-trip similarity and deterministic neuro-symbolic execution.
citing papers explorer
-
NoisyCausal: A Benchmark for Evaluating Causal Reasoning Under Structured Noise
NoisyCausal benchmark tests LLMs on causal reasoning with structured noise, and a modular LLM-plus-causal-graph framework outperforms baselines while generalizing to Cladder.
-
CodeClinic: Evaluating Automation of Coding Skills for Clinical Reasoning Agents
CodeClinic benchmark demonstrates that LLM-generated Python skill libraries from clinical guidelines enhance consistency and reduce token consumption by up to 40% compared to zero-shot approaches on MIMIC-IV based tasks.
-
From Natural Language to Executable Narsese: A Neuro-Symbolic Benchmark and Pipeline for Reasoning with NARS
A new benchmark and deterministic pipeline translate natural language reasoning into executable Narsese for NARS, with execution-based validation and initial LLM adaptation for three-label classification.
-
LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models
LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.
-
LLM Reasoning Is Latent, Not the Chain of Thought
LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.
-
VeriTrans: Fine-Tuned LLM-Assisted NL-to-PL Translation via a Deterministic Neuro-Symbolic Pipeline
VeriTrans achieves 94.46% SAT/UNSAT correctness on SatBench via LLM translation gated by round-trip similarity and deterministic neuro-symbolic execution.