pith. sign in

arxiv: 2606.00579 · v1 · pith:JOW3CFTNnew · submitted 2026-05-30 · 💻 cs.CL · cs.CV

Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

Pith reviewed 2026-06-28 18:58 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords coding agentsomnimodal tasksmultimodal benchmarkssandboxed tool usevideo understandingaudio processingagent scaffoldsevidence extraction
0
0 comments X

The pith

Coding agents limited to text and images match or beat native omnimodal models on audio-video benchmarks by using code to extract evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that agents restricted to text-plus-image inputs and a sandboxed code-execution interface can equal or surpass state-of-the-art native multimodal models across several video and audio benchmarks. Their advantage arises because the agents write and run code to pull targeted signals such as transcripts or key frames rather than ingesting full media streams. A sympathetic reader would care because this reframes omnimodal problems as ordinary retrieval and processing tasks that existing tool interfaces already solve. The authors further show that adding human-written or self-distilled skills raises performance and release an open training recipe called Code-X together with a new process-level benchmark called TerminalBench-O.

Core claim

Coding agents equipped only with text and image access plus a sandboxed tool-use interface match and in several settings outperform SOTA native omnimodal models and predefined multimodal agent scaffolds on multiple audio-video benchmarks; their strength lies in writing code that extracts relevant evidence from transcripts, frames, and other signals, thereby converting the tasks into retrieval and information-processing problems.

What carries the argument

The sandboxed tool-use interface that lets agents write and execute code to orchestrate evidence extraction from modality signals.

If this is right

  • Simple injection of human-written or self-distilled skills substantially raises agent performance on these tasks.
  • The Code-X training recipe using the OmniCoding trajectory dataset and verifiable reward produces usable baselines on 9B and 27B open-source models.
  • Process-level trace analysis and a failure taxonomy reveal concrete limits of the current approach.
  • TerminalBench-O provides a process-level benchmark for evaluating real-world many-modality processing tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the reduction to retrieval holds, then many existing code-execution environments could be repurposed for multimodal work without new model architectures.
  • The failure taxonomy suggests targeted improvements in code generation for complex temporal reasoning would extend the approach.
  • TerminalBench-O could serve as a template for process-level evaluation in other agent domains beyond audio and video.

Load-bearing premise

Performance gains come primarily from code-based evidence extraction rather than from benchmark-specific artifacts or the particular choice of underlying LLM and sandbox environment.

What would settle it

A controlled experiment in which the same agents lose the ability to write and run extraction code yet still match or exceed the native omnimodal models on the same benchmarks.

Figures

Figures reproduced from arXiv: 2606.00579 by Dianqi Li, Dongping Chen, Qingyuan Shi, Tianyi Zhou, Xuanao Huang, Zhihan Hu.

Figure 1
Figure 1. Figure 1: We discover that coding agents are strong omnimodal processors, achieving competitive performance and even surpassing native omnimodal models with fewer tokens on video and audio content through terminal tool-use. Dongping Chen and Xuanao Huang contributed equally to this work. Corresponding author: Tianyi Zhou: david.tianyi.zhou@gmail.com. arXiv:2606.00579v1 [cs.CL] 30 May 2026 [PITH_FULL_IMAGE:figures/f… view at source ↗
Figure 2
Figure 2. Figure 2: Tool-use distributions of GPT-5.4 high and Claude Opus 4.6 max across four benchmarks. Finding 2: Increasing reasoning effort generally improves coding-agent performance, suggesting that omnimodal task success depends not only on model perception capacity but also on the depth of agentic computation. Tool-use Analysis. We analyze tool-use behavior in coding agents [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pareto-front of Acc-Token tradeoff. We find that sandboxed coding agents are efficient and competitive omnimodal task solvers on avg. of four benchmarks. The gray line indicates the estimated MLLM baseline. reasoning-effort settings tend to use more tools, most visibly on OmniGAIA and LVOmniBench. Finding 3: Omnimodal problem solving proceeds through a staged tool-mediated pipeline, where media extraction,… view at source ↗
Figure 4
Figure 4. Figure 4: Trajectory DAG of a OmniGAIA sample (Baylor campus-tour sign × Texas sports-facility audio; ground-truth 169 km). Nodes are annotated as agent steps, coloured by the step-supervisor reward (✓ / 0 / × for +1 / 0 / −1). Audio sub-goal includes three parallel strategies: speech_recognition, whisper, and YouTube-ID lookup; evidence from the Image and Audio sub-goals merges into Distance. 10% 26% 40% 20% Kimi K… view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of primary error types across four coding agents on OmniGAIA, one pie per agent. The sample size n is shown under each agent name. 2.2. Failure Analysis: Taxonomy and Process-level Trajectory. Given the strong performance of coding agents, we investigate the challenges they face in omni content processing. We propose a new failure mode taxonomy based on task type, and sample 200 trajectories (… view at source ↗
Figure 6
Figure 6. Figure 6: A representative OmniGAIA case comparing GPT-5.4 high without Skills and with Human-in-the-loop Skills. The no-Skills run answers incorrectly by directly applying the Southern B.C. special 400 m rule to Tofino, whereas the Human-in-the-loop-Skills run answers correctly using 200 m rule for the waters immediately offshore of Tofino. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance of GPT-5.4 high on OmniGAIA under four settings. Both self-improving methods outperform the base￾line, with Log-driven self-distillation achieving the strongest results: 76.7% average accuracy versus 73.0% for Calibration-set self-iteration, and the gap widens on harder problems (65.4% vs. 59.0% on the High split). Execution traces appear to provide richer supervision than binary feedback from … view at source ↗
Figure 8
Figure 8. Figure 8: Duration distribution. draw from four complementary sources, OmniGAIA-SFT-2K (Li et al., 2026c), OmniVideoBench (Li et al., 2025), AVUTBenchmark (Yang et al., 2025b), and Video-MME-v2 (Fu et al., 2026), keeping only the Video￾MME-v2 subset requiring audio-visual or temporal reasoning [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overview of our benchmark, illustrated with Task T01. The coding agent is required to generate a highlight clip and caption from a soccer video and a player query, and its outputs are evaluated by an LLM-based judge in terms of event accuracy, video quality, and task correctness. benchmarks such as Terminal-Bench (Merrill et al., 2026b) and Claw-Eval (Ye et al., 2026) mainly target text or text+image tasks… view at source ↗
Figure 10
Figure 10. Figure 10: Overview of our process-level benchmark. (a) Dataset Overview: domain distribution across the union of OmniGAIA, LVOmniBench and SocialOmniBench (200 tasks); a word cloud of the question text; the distribution of annotated logical steps per task overlaid with the agent’s actual turn count. (b) Capability Analysis: per-task counts of image / audio / video inputs; required-versus-actual tool-category covera… view at source ↗
Figure 11
Figure 11. Figure 11: Annotation interface of our process-level dataset. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: First part of prompt for calibration-set self-iteration. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Second part of prompt for calibration-set self-iteration. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt for testing SocialOmni Level1. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompt for testing SocialOmni Level2. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prompt for testing LVOmniBench. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt for testing VideoZeroBench, where all questions for the same video are answered together in a single response. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Shared workspace, leakage-prevention, network-use, and final-answer rules. J. Additional Experiment Results [PITH_FULL_IMAGE:figures/full_fig_p049_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Runtime reminders used when the model stops without a tool call or calls task_complete before producing a wrapped answer. Skill Setting Low Medium High Avg. No Skills 70.4 60.0 50.0 61.4 Human-in-the-loop Skills 80.3 68.8 59.0 70.5 Log-driven Self-distillation 86.0 75.0 65.4 76.7 Calibration-set Self-iteration 83.6 71.9 59.0 73.0 [PITH_FULL_IMAGE:figures/full_fig_p050_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Benchmark-specific prompt bodies for SocialOmni Level 1 and Level 2. 51 [PITH_FULL_IMAGE:figures/full_fig_p051_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Benchmark-specific prompt body for LVOmniBench. 52 [PITH_FULL_IMAGE:figures/full_fig_p052_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Benchmark-specific grouped-video prompt body for VideoZeroBench. 53 [PITH_FULL_IMAGE:figures/full_fig_p053_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Prompt used for calibration-set self-iteration. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Representative failure case: audio perception and extraction. 55 [PITH_FULL_IMAGE:figures/full_fig_p055_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Representative failure case: video perception and extraction. 56 [PITH_FULL_IMAGE:figures/full_fig_p056_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Representative failure case: insufficient exploration of modal content. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Representative failure case: knowledge retrieval and factual error. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Representative failure case: logical reasoning and calculation. 59 [PITH_FULL_IMAGE:figures/full_fig_p059_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Representative failure case: tool and environment infrastructure failure. 60 [PITH_FULL_IMAGE:figures/full_fig_p060_29.png] view at source ↗
read the original abstract

As multimodal LLMs increasingly target video and audio, it is often assumed that such tasks require native omnimodal models. We show that this is not always the case: coding agents with only text+image access and a sandboxed tool-use interface can match, and in several settings outperform, SOTA native omnimodal models and predefined multimodal agent scaffolds across multiple audio-video benchmarks. Our trajectory analysis suggests that their strength comes from writing code and orchestrating tools to extract relevant evidence from transcripts, frames, and other modality signals, thereby converting omnimodal tasks into retrieval and information-processing problems rather than ingesting entire media streams. We further characterize their limitations through a failure taxonomy and process-level trace analysis, and show that simple skill injection, including human-written and self-distilled skills, substantially improves performance. To explore open-source elicitation, we introduce Code-X, a training recipe with the OmniCoding trajectory dataset and verifiable reward, and provide baselines on Qwen-3.5-9B and Qwen-3.6-27B. Finally, we argue that the next frontier is many-modality processing, and introduce TerminalBench-O, a process-level benchmark for real-world omnimodal processing tasks. Code will be available at https://github.com/Dongping-Chen/OmniCoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that sandboxed coding agents limited to text+image inputs and tool-use interfaces can match or outperform SOTA native omnimodal models and multimodal agent scaffolds on audio-video benchmarks. It attributes this to code-based extraction of evidence from transcripts/frames rather than native multimodal fusion, supports the claim via trajectory analysis and failure taxonomy, shows gains from skill injection, introduces the Code-X training recipe with OmniCoding trajectories and verifiable rewards (with baselines on Qwen-3.5-9B and Qwen-3.6-27B), and proposes TerminalBench-O as a process-level benchmark for many-modality tasks.

Significance. If the central empirical claim holds after isolating the agent interface from base-model scale and benchmark artifacts, the result would be significant: it would demonstrate that many omnimodal tasks can be reduced to retrieval/processing problems solvable by code orchestration, reducing reliance on native multimodal architectures and opening a path for open-source elicitation via verifiable rewards. The introduction of TerminalBench-O and the Code-X recipe are concrete contributions that could be reused.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central claim that gains arise 'primarily from writing code and orchestrating tools' rather than from the choice of underlying LLM or sandbox affordances is not isolated. No ablation is described that holds the base model fixed while varying only the agent interface (coding agent vs. native omnimodal), nor are results reported on tasks where tool access cannot substitute for native fusion; without these controls the attribution remains correlational.
  2. [§4.2] §4.2 (Trajectory Analysis): the process-level traces are used to argue for the code-extraction mechanism, but the section does not report quantitative metrics (e.g., fraction of successful trajectories that rely on code vs. direct LLM reasoning) or a controlled comparison against a non-coding agent using the same LLM and sandbox; this leaves the mechanistic explanation under-supported relative to the strength of the headline claim.
  3. [§5] §5 (Code-X and open-source elicitation): the reported baselines on Qwen-3.5-9B and Qwen-3.6-27B use the new OmniCoding dataset and verifiable reward, yet no comparison is provided against the same models fine-tuned with standard SFT or RL on the original benchmark data; this makes it difficult to attribute any improvement specifically to the coding-agent formulation versus dataset or reward design.
minor comments (2)
  1. [Abstract] The abstract introduces 'Code-X' and 'TerminalBench-O' without a one-sentence definition or pointer to the section where they are formally defined; add these on first use.
  2. [Figures in §4] Figure captions for the failure taxonomy and trajectory visualizations should explicitly state the number of trajectories or examples analyzed and the inter-annotator agreement if human labeling was involved.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments that help clarify the strength of our empirical claims. We address each major point below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim that gains arise 'primarily from writing code and orchestrating tools' rather than from the choice of underlying LLM or sandbox affordances is not isolated. No ablation is described that holds the base model fixed while varying only the agent interface (coding agent vs. native omnimodal), nor are results reported on tasks where tool access cannot substitute for native fusion; without these controls the attribution remains correlational.

    Authors: We agree that a controlled ablation holding the base model fixed would strengthen attribution to the coding-agent interface. Our current results compare against published SOTA native omnimodal models that use different base models and training. The manuscript does not contain such an ablation. We will revise the abstract and §4 to explicitly acknowledge this limitation and clarify that the claim is supported by trajectory evidence of code usage rather than a fully isolated causal demonstration. A full ablation is not feasible without new experiments matching exact base models across interfaces. revision: partial

  2. Referee: [§4.2] §4.2 (Trajectory Analysis): the process-level traces are used to argue for the code-extraction mechanism, but the section does not report quantitative metrics (e.g., fraction of successful trajectories that rely on code vs. direct LLM reasoning) or a controlled comparison against a non-coding agent using the same LLM and sandbox; this leaves the mechanistic explanation under-supported relative to the strength of the headline claim.

    Authors: We acknowledge that quantitative metrics would better support the mechanistic argument. §4.2 currently presents qualitative trajectories and a failure taxonomy. We will revise the section to include quantitative statistics computed from the existing trajectory data, such as the fraction of successful trajectories that rely on code-based extraction versus direct reasoning. A controlled comparison to a non-coding agent would require additional runs, but the added metrics will provide stronger quantitative grounding for the code-extraction claim. revision: yes

  3. Referee: [§5] §5 (Code-X and open-source elicitation): the reported baselines on Qwen-3.5-9B and Qwen-3.6-27B use the new OmniCoding dataset and verifiable reward, yet no comparison is provided against the same models fine-tuned with standard SFT or RL on the original benchmark data; this makes it difficult to attribute any improvement specifically to the coding-agent formulation versus dataset or reward design.

    Authors: We agree that comparisons against standard SFT and RL on the original benchmark data are needed to isolate the contribution of the coding-agent formulation. We will add these baselines for both Qwen models in the revised §5, training with standard SFT/RL on the original benchmark data under the same compute budget. This will allow direct attribution of gains to the OmniCoding trajectories and verifiable rewards. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on benchmark comparisons without derivations or self-referential reductions

full rationale

The paper presents an empirical argument that sandboxed coding agents with text+image access can match or exceed native omnimodal models on audio-video tasks by converting them into code-orchestrated retrieval problems. The abstract and provided text contain no equations, fitted parameters, predictions derived from prior fits, or load-bearing self-citations. Claims are supported by experimental results, trajectory analysis, failure taxonomy, and new benchmarks (Code-X, TerminalBench-O) rather than any derivation that reduces to its own inputs by construction. No self-definitional, fitted-input, or uniqueness-imported patterns appear. The central attribution to code-based evidence extraction is an interpretation of results, not a mathematical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Only abstract available; ledger is therefore minimal and provisional.

axioms (1)
  • domain assumption Omnimodal tasks can be reframed as retrieval and information-processing problems solvable by code orchestration without native multimodal ingestion
    Central premise stated in the trajectory analysis sentence of the abstract.
invented entities (2)
  • Code-X no independent evidence
    purpose: Training recipe using OmniCoding trajectory dataset and verifiable reward
    New training method introduced in the abstract.
  • TerminalBench-O no independent evidence
    purpose: Process-level benchmark for real-world omnimodal processing tasks
    New benchmark introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5775 in / 1308 out tokens · 21337 ms · 2026-06-28T18:58:13.611085+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 8 canonical work pages · 5 internal anchors

  1. [1]

    MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos

    URLhttps://arxiv.org/abs/2603.14145. Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, and Xiangyu Yue. Av-odyssey bench: Can your multimodal llms really understand audio-visual information?arXiv preprint arXiv:2412.02611, 2024. URLhttps://arxiv.org/abs/2412.02611. Jack Hong, S...

  2. [2]

    org/abs/2406.09403

    URLhttps://arxiv.org/abs/2406.09403. Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kilian Lieret, Karthik Narasimhan, and Ofir Press. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023. URL https://arxiv.org/abs/2310.06770. Geewook Kim and Minjoon Seo. Do modern video-llms need to listen? a ...

  3. [3]

    Gorilla: Large Language Model Connected with Massive APIs

    Accessed: 2026-04-01. Nous Research. Hermes agent: The agent that grows with you. https://github.com/nousresearch/ hermes-agent, 2026. Accessed: 2026-04-27. OpenAI. Introducing swe-bench verified. OpenAI blog, 2024. URL https://openai.com/index/ introducing-swe-bench-verified/. Updated February 24, 2025. OpenAI. Introducing GPT-5.4.https://openai.com/inde...

  4. [4]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    URLhttps://arxiv.org/abs/2307.16789. Qwen Team. Qwen3.5-Omni: Scaling up, toward native omni-modal AGI.https://qwen.ai/blog?id=qwen3. 5-omni, 2026. Accessed: 2026-04-01. Ahmed Y. Radwan, Christos Emmanouilidis, Hina Tabassum, Deval Pandya, and Shaina Raza. Sonic-o1: A real- world benchmark for evaluating multimodal large language models on audio-video und...

  5. [5]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    URLhttps://arxiv.org/abs/2302.04761. Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704,

  6. [6]

    SWE-agent Team

    URLhttps://arxiv.org/abs/2406.15704. SWE-agent Team. mini-swe-agent: The minimal ai software engineering agent.https://github.com/SWE-agent/ mini-swe-agent, 2025. Accessed: 2026-05-02. Keda Tao, Wenjie Du, Bohan Yu, Weiqiang Wang, Jian Liu, and Huan Wang. Active perception agent for omnimodal audio-video understanding.arXiv preprint arXiv:2512.23646, 2025...

  7. [7]

    Qwen2.5-Omni Technical Report

    URLhttps://arxiv.org/abs/2503.20215. Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, and Jian Zhang. On the tool manipulation capability of open-source large language models.arXiv preprint arXiv:2305.16504, 2023. URLhttps://arxiv.org/abs/ 2305.16504. 22 Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, ...

  8. [8]

    Required

    target more realistic settings that require integrating visual, auditory, and textual evidence over longer horizons. At the same time, recent audits show that several audio-video benchmarks admit strong visual shortcuts (Kim and Seo, 2025), suggesting that progress in omni-modal evaluation depends not only on stronger models but also on shortcut-resistant...

  9. [9]

    Update the previous guide into a better next-round guide

  10. [10]

    Use only generic, reusable tactics suggested by the sanitized summary

  11. [11]

    Do not include benchmark-specific facts, named entities, dates, exact answers, or any clues tied to individual cases

  12. [12]

    Do not quote or paraphrase specific questions

  13. [13]

    question_id

    Keep the guide concise, operational, and directly useful during future runs. What to extract from the summary: - recurring failure patterns - search-breadth problems - weak verification habits - answer-format mistakes - underused or misused tools - signals about when a workflow should escalate from local inspection to search, OCR, ASR, calculation, or mul...

  14. [14]

    workspace and leakage rules,

  15. [15]

    tool-use heuristics,

  16. [16]

    media-processing workflows,

  17. [17]

    verification checkpoints,

  18. [18]

    answer-format discipline,

  19. [19]

    name": "Bash

    common recovery rules. - Explain not only which tools to use, but in what order and with what verification checks. - Convert stronger reference pipelines into reusable playbooks rather than case-specific tips. Hard prohibitions: - Do not include benchmark-specific examples. - Do not quote or paraphrase individual questions. - Do not include named entities...