pith. machine review for the scientific record. sign in

arxiv: 2604.13536 · v2 · submitted 2026-04-15 · 💻 cs.OS

Recognition: unknown

Don't Let AI Agents YOLO Your Files: Shifting Information and Control to Filesystems for Agent Safety and Autonomy

Andrea C. Arpaci-Dusseau, Jing Liu, Junxuan Liao, Mai Zheng, Remzi H. Arpaci-Dusseau, Shawn Wanxiang Zhong

Pith reviewed 2026-05-10 12:22 UTC · model grok-4.3

classification 💻 cs.OS
keywords AI agentsfilesystem safetyagent autonomystagingsnapshotsprogressive permissionsfile operations
0
0 comments X

The pith

An agent-native filesystem stages changes and gives agents snapshots so they can self-correct mistakes while cutting user prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that AI coding agents regularly harm filesystems because they lack clear information about their effects and lack mechanisms to undo or review them. Current solutions force a choice between full access that risks damage and constant permission requests that slow agents down. YoloFS addresses this by isolating every mutation until commit, exposing snapshots to the agent for its own error detection, and gating access through progressive permissions that ask users only when necessary. Evaluation on tasks with hidden side effects and on routine work demonstrates that agents correct themselves more often and users interact less often without losing success rates. A reader would care because the approach moves safety primitives into the infrastructure that agents already use, rather than layering them on top.

Core claim

YoloFS is an agent-native filesystem whose staging isolates all mutations before commit, whose snapshots let agents detect and undo their own side effects, and whose progressive permission grants access with minimal user prompts. On 11 tasks containing hidden side effects, agents using YoloFS performed self-correction in 8 cases while every change remained staged and reviewable. On 112 routine tasks, the same system matched baseline success rates yet required fewer user interactions.

What carries the argument

YoloFS, an agent-native filesystem that implements staging to hold mutations uncommitted, snapshots visible to agents for self-review, and progressive permission that escalates access checks only as needed.

If this is right

  • Agents gain the ability to detect and reverse their own filesystem errors before they become permanent.
  • Users see every change held in a reviewable state rather than applied immediately.
  • Routine agent tasks finish at the same success rate with measurably fewer permission requests.
  • Hidden side effects become visible to both agent and user through the staged and snapshot layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staging-plus-snapshot pattern could be applied to other shared resources such as network sockets or process state to give agents corrective control beyond files.
  • Agent frameworks could adopt filesystem-level snapshots as a standard primitive, reducing reliance on application-level logging or undo stacks.
  • Widespread use would shift safety engineering from prompt engineering and sandbox wrappers toward infrastructure guarantees that persist across agent versions.
  • If the mechanisms prove lightweight, they could serve as a model for giving other autonomous software systems self-auditing capabilities without human oversight.

Load-bearing premise

Staging, snapshots, and progressive permissions can be added to a real filesystem without breaking compatibility with existing programs or imposing performance costs that agents and users will reject.

What would settle it

A deployment in which agents using the staged filesystem complete fewer tasks than the baseline or in which the snapshot and permission overhead causes users to issue the same or greater number of prompts.

Figures

Figures reproduced from arXiv: 2604.13536 by Andrea C. Arpaci-Dusseau, Jing Liu, Junxuan Liao, Mai Zheng, Remzi H. Arpaci-Dusseau, Shawn Wanxiang Zhong.

Figure 1
Figure 1. Figure 1: Shifting information (I) and control (C) from agents to filesystems improves safety while reducing user interaction. The dilemma: safety vs. autonomy. Unfortunately, agents regularly cause damage. They have wiped entire drives [R4], destroyed irreplaceable personal documents [R22], and silently leaked credentials to attackers [R231]. To prevent such dam￾age, most frameworks prompt the user before taking ac… view at source ↗
Figure 2
Figure 2. Figure 2: The current agent–filesystem interface. The framework coordinates the user, the model, and the filesystem. • We propose the principles of agent-native filesystems and build YoloFS with three novel techniques: staging for vis￾ibility and user corrective control, snapshots and travel for agent self-correction, and progressive permission for preventive control with low user interaction. • We introduce an agen… view at source ↗
Figure 3
Figure 3. Figure 3: summarizes the impact across five dimensions. Op (207) write 44% delete 39% read Scope (205) project 58% system home secret Agent (152) unaware 68% regretful 21%lied User (201) noticed 83% delay no Undo (180) failed 40% trivial 31% recoverable 29% [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Taxonomy of causes (290 total). Counts can overlap. in git [R40]). Of the remainder, 31% are trivially recoverable (e.g. git checkout) and 29% rely on user effort (e.g. recover￾able only from backups [R129]). Some actions are inherently irreversible: a leaked credential cannot be unread [R239]. Finding 2 (insufficient control): Users and agents have in￾sufficient control over the filesystem effects of tool… view at source ↗
Figure 5
Figure 5. Figure 5: Codex prompts the user with the generated shell script. or similar settings that remove them from the decision en￾tirely. The flag names themselves acknowledge the danger, yet users enable them [R244] because the alternative is an￾swering hundreds of prompts per session. Related to deci￾sion fatigue [73] in psychology and approval fatigue in se￾curity [23, 89], this problem is also called such in an agenti… view at source ↗
Figure 6
Figure 6. Figure 6: Override tree and journal after each operation on a base directory d1/ containing files x, y, z. The first column shows the base state; each subsequent column shows the override tree (top) and journal record (bottom) after that step. In the override tree, ino x = StagedFile(x), path = BasePath(path), and ∅ = Tombstone. 4.3 Snapshots for Agent Self-Correction Staging makes filesystem effects visible and rev… view at source ↗
Figure 7
Figure 7. Figure 7: Snapshot (P) and travel (T) markers partition the journal into segments. 𝑛 are generation numbers. indicate travel tar￾gets. At 4 , seg 0, 1, and 3 are live and seg 2 is dead. actions via ioctls, and the kernel appends the corresponding marker records alongside the action records from §4.2. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Metadata operation latency. The files can reside in the base filesystem, a snapshot, or the staging area. 50 100 0 200 400 600 Latency (µs/op) create 50 100 read 50 100 0 10 20 30 ms commit 10s 100 Number of snapshots YoloFS OverlayFS BranchFS [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Snapshot scalability. As the number of snapshots grows, do filesystems become slower? (OverlayFS fails at ~50 snapshots.) 0 6 12 Worktree 0 8 16 Init. Build 0 0.025 0.05 Read 0 0.5 1 Edit 0 6 12 Incr. Build 0 2 4 Git 0 20 40 60 Time (s) Total Base YoloFS OverlayFS run snapshot commit [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: A developer workload of setting up and iterating on the Linux kernel codebase. Setup. We run experiments on a machine with an AMD EPYC 7302P 16-Core Processor at 3 GHz and 125 GB of DDR4- 3200 memory. We use Ubuntu 24.04 and the default Linux 6.8 kernel. For the base filesystem, all experiments use an Ext4 filesystem formatted and mounted with default options on a SATA 3.2 SSD with 480 GB capacity. Single… view at source ↗
read the original abstract

AI coding agents operate directly on users' filesystems, where they regularly corrupt data, delete files, and leak secrets. Current approaches force a tradeoff between safety and autonomy: unrestricted access risks harm, while frequent permission prompts burden users and block agents. To understand this problem, we conduct the first systematic study of agent filesystem misuse, analyzing 290 public reports across 13 frameworks. Our analysis reveals that today's agents have limited information about their filesystem effects and insufficient control over them. We therefore argue for shifting this information and control to the filesystem itself. Based on this principle, we design YoloFS, an agent-native filesystem with three techniques. Staging isolates all mutations before commit, giving users corrective control. Snapshots extend this control to agents, letting them detect and correct their own mistakes. Progressive permission provides users with preventive control by gating access with minimal interaction. To evaluate YoloFS, we introduce a new methodology that captures user-agent-filesystem interactions. On 11 tasks with hidden side effects, YoloFS enables agent self-correction in 8 while keeping all effects staged and reviewable. On 112 routine tasks, YoloFS requires fewer user interactions while matching the baseline success rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts the first systematic study of AI agent filesystem misuse, analyzing 290 public reports across 13 frameworks to show that agents have limited information about and control over filesystem effects. It proposes YoloFS, an agent-native filesystem with three techniques—staging to isolate mutations before commit, snapshots to enable agent self-correction, and progressive permissions to gate access with minimal user interaction—and evaluates it via a new methodology capturing user-agent-filesystem interactions. On 11 tasks with hidden side effects, YoloFS enables self-correction in 8 cases while keeping effects staged and reviewable; on 112 routine tasks, it requires fewer user interactions while matching baseline success rates.

Significance. If the mechanisms hold in practice, the work could meaningfully advance agent safety by moving information and control into the filesystem layer rather than relying on prompts or restrictions. The systematic misuse study provides a useful empirical foundation. The new evaluation methodology is a positive step toward reproducible agent-FS interaction testing. However, without demonstrated implementation feasibility, the practical significance remains provisional.

major comments (2)
  1. [Evaluation] Evaluation section: The headline results (self-correction in 8/11 hidden-side-effect tasks; fewer interactions on 112 routine tasks) are load-bearing for the central claims yet the abstract and evaluation provide no details on task selection criteria, baseline implementations, statistical significance testing, or potential confounds, preventing verification of the reported gains.
  2. [YoloFS Design] YoloFS design (staging, snapshots, progressive permissions): These three mechanisms are essential to the safety/autonomy argument, but the manuscript contains no performance overhead measurements, POSIX compatibility analysis, or implementation approach (kernel vs. user-space), directly bearing on the skeptic concern that the benefits may not translate beyond an idealized prototype.
minor comments (2)
  1. [Abstract] Abstract: YoloFS is introduced without expanding the acronym or briefly situating it relative to existing filesystems (e.g., FUSE-based or kernel modifications).
  2. [Evaluation] The new evaluation methodology is described at a high level; a dedicated subsection or appendix with pseudocode or interaction traces would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the value of the systematic misuse study and the new evaluation methodology. We address each major comment below and commit to targeted revisions that improve verifiability and demonstrate implementation feasibility.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The headline results (self-correction in 8/11 hidden-side-effect tasks; fewer interactions on 112 routine tasks) are load-bearing for the central claims yet the abstract and evaluation provide no details on task selection criteria, baseline implementations, statistical significance testing, or potential confounds, preventing verification of the reported gains.

    Authors: We agree that the current evaluation section lacks the necessary detail for independent verification. In the revised manuscript we will expand the Evaluation section to include: explicit criteria used to select the 11 hidden-side-effect tasks and the 112 routine tasks (derived from patterns in the 290 misuse reports); full descriptions of the baseline agent implementations and filesystem configurations; results of statistical significance testing on success rates and interaction counts; and an explicit discussion of potential confounds such as agent stochasticity and task environment variability. These additions will allow readers to assess the reported gains in self-correction and reduced user interactions. revision: yes

  2. Referee: [YoloFS Design] YoloFS design (staging, snapshots, progressive permissions): These three mechanisms are essential to the safety/autonomy argument, but the manuscript contains no performance overhead measurements, POSIX compatibility analysis, or implementation approach (kernel vs. user-space), directly bearing on the skeptic concern that the benefits may not translate beyond an idealized prototype.

    Authors: We acknowledge that the manuscript currently omits quantitative overhead data, POSIX compatibility details, and an explicit implementation strategy, which leaves open questions about practical realization. We will add a new subsection titled 'Prototype Implementation and Overhead Analysis' that describes a user-space FUSE-based prototype supporting the three mechanisms, reports preliminary benchmark results for common operations (file creation, mutation staging, snapshot creation, and permission checks), and provides a compatibility analysis showing that standard POSIX semantics are preserved for agent-relevant calls while the new agent-native features are layered on top. This material will directly address feasibility concerns and show that the design is not limited to an idealized prototype. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical study and new evaluation methodology are independent of results

full rationale

The paper's chain proceeds from an external analysis of 290 public reports across 13 frameworks, to a design argument for shifting control to the FS, to an introduced evaluation methodology applied to 11 hidden-side-effect tasks and 112 routine tasks. No equations, fitted parameters, or self-referential definitions appear. Claims about self-correction rates and interaction counts are presented as direct measurements on described tasks rather than reductions to prior fits or self-citations. The central results remain falsifiable via the stated task set and do not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems design and empirical evaluation paper; it introduces no mathematical free parameters, domain axioms, or invented entities such as new particles or forces.

pith-pipeline@v0.9.0 · 5544 in / 1169 out tokens · 59525 ms · 2026-05-10T12:22:08.384068+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

113 extracted references · 10 canonical work pages · 3 internal anchors

  1. [1]

    Android contributors. 2026. Open files using the Storage Access Framework | App data and files | Android Developers. Retrieved March 31, 2026 fromhttps://developer.android.com/guide/topics/ providers/document-provider

  2. [2]

    Android contributors. 2026. Scoped storage | Android Open Source Project. Retrieved March 31, 2026 fromhttps://source.android.com/ docs/core/storage/scoped

  3. [3]

    Anthropic. 2026. Building a C compiler with a team of parallel Claudes. Retrieved March 31, 2026 fromhttps://www.anthropic. com/engineering/building-c-compiler

  4. [4]

    Anthropic PBC. 2026. Claude Code overview. Retrieved March 27, 2026 fromhttps://code.claude.com/docs/en/overview

  5. [5]

    Anthropic PBC. 2026. Configure permissions - Claude Code Docs. Retrieved March 27, 2026 fromhttps://code.claude.com/docs/en/ permissions#read-and-edit

  6. [6]

    Anthropic PBC. 2026. Development containers - Claude Code Docs. Retrieved March 31, 2026 fromhttps://code.claude.com/docs/en/ devcontainer

  7. [7]

    Anthropic PBC. 2026. Hooks reference - Claude Code Docs. Re- trieved April 01, 2026 fromhttps://code.claude.com/docs/en/hooks

  8. [8]

    Anthropic PBC. 2026. Introducing Claude Opus 4.6. Retrieved April 01, 2026 fromhttps://www.anthropic.com/news/claude-opus-4-6

  9. [9]

    Antirez. 2026. Don’t fall into the anti-AI hype. Retrieved March 31, 2026 fromhttps://antirez.com/news/158

  10. [10]

    Anysphere, Inc. 2026. Cursor: The best way to code with AI. Re- trieved March 27, 2026 fromhttps://cursor.com/

  11. [11]

    Anysphere, Inc. 2026. Ignore File | Cursor Docs. Retrieved March 27, 2026 fromhttps://cursor.com/docs/reference/ignore-file

  12. [12]

    Anysphere, Inc. 2026. Terminal | Cursor Docs. Retrieved April 01, 2026 fromhttps://cursor.com/docs/agent/tools/terminal

  13. [13]

    AppArmor contributors. 2026. AppArmor. Retrieved March 31, 2026 fromhttps://apparmor.net/

  14. [14]

    2019.Computer Security: Art and Science(2 ed.)

    Matt Bishop. 2019.Computer Security: Art and Science(2 ed.). Addison-Wesley Educational, Boston, MA

  15. [15]

    Maximilian Blochberger, Jakob Rieck, Christian Burkert, Tobias Mueller, and Hannes Federrath. 2019. State of the sandbox: Investi- gating macOS application security. InProceedings of the 18th ACM Workshop on Privacy in the Electronic Society. 150–161

  16. [16]

    Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, and Mark Shellenbaum. 2003. The zettabyte file system. InProc. of the 2nd Usenix Conference on File and Storage Technologies, Vol. 215. 1

  17. [17]

    Harold Booth. 2026. National Vulnerability Database. National Institute of Standards and Technology. Retrieved March 28, 2026 fromhttps://nvd.nist.gov/

  18. [18]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few- shot learners.Advances in Neural Information Processing Systems33 (2020), 1877–1901

  19. [19]

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al . [n. d.]. Why do multi-agent LLM systems fail?. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  20. [20]

    Cline. 2026. Cline Documentation.https://docs.cline.bot/home. AI coding agent for editor and terminal workflows. Accessed: 2026-04- 01

  21. [21]

    Andrea Continella, Alessandro Guagnelli, Giovanni Zingaro, Giulio De Pasquale, Alessandro Barenghi, Stefano Zanero, and Federico Maggi. 2017. ShieldFS: The last word in ransomware resilient filesys- tems. InBlack Hat Europe 2017

  22. [22]

    Microsoft Corporation. 2026. Copilot on Windows: Your Built-In AI Assistant. Retrieved March 27, 2026 fromhttps://www.microsoft. com/en-us/windows/windows-11?wincampaign=copilot

  23. [23]

    2025.Cybersecurity Advisory AA23-320A: Scat- tered Spider

    Cybersecurity and Infrastructure Security Agency and Federal Bu- reau of Investigation. 2025.Cybersecurity Advisory AA23-320A: Scat- tered Spider. Cybersecurity Advisory. U.S. Department of Homeland Security. Retrieved March 28, 2026 fromhttps://www.cisa.gov/news- events/cybersecurity-advisories/aa23-320a

  24. [24]

    Daytona Platforms Inc. 2026. Daytona - Secure Infrastructure for Running AI-Generated Code. Retrieved March 31, 2026 fromhttps: //www.daytona.io/

  25. [25]

    DeepSeek-AI et al . 2025. DeepSeek-V3 technical report. arXiv:2412.19437

  26. [26]

    Xianzhong Ding, Le Chen, Murali Emani, Chunhua Liao, Pei-Hung Lin, Tristan Vanderbruggen, Zhen Xie, Alberto Cerpa, and Wan Du. 2023. HPC-GPT: Integrating Large Language Model for High- Performance Computing. InProceedings of the SC ’23 Workshops of the International Conference on High Performance Computing, Net- work, Storage, and Analysis (SC-W ’23). Ass...

  27. [27]

    Docker Inc. 2026. Docker Sandboxes | Docker Docs. Retrieved March 31, 2026 fromhttps://docs.docker.com/ai/sandboxes/

  28. [28]

    Docker Inc. 2026. OverlayFS storage driver | Docker Docs. Retrieved March 31, 2026 fromhttps://docs.docker.com/engine/ storage/drivers/overlayfs-driver/

  29. [29]

    Edera, Inc. 2026. Edera | Meet Hardened Runtime. Retrieved March 31, 2026 fromhttps://edera.dev/

  30. [30]

    Csaba Fitzl and Wojciech Reguła. 2022. Knockout win against TCC, a.k.a. 20+ NEW ways to bypass your macOS privacy mecha- nisms. Presentation at Black Hat Europe 2022. Retrieved March 28, 2026 fromhttps://i.blackhat.com/EU-22/Thursday-Briefings/EU-22- Fitzl-Knockout-Win-Against-TCC.pdf

  31. [31]

    FoundryLabs, Inc. 2026. E2B | The Enterprise AI Agent Cloud. Retrieved March 31, 2026 fromhttps://e2b.dev/

  32. [32]

    Geminicli. 2026. Sandboxing in the Gemini CLI | Gemini CLI. Re- trieved March 31, 2026 fromhttps://geminicli.com/docs/cli/sandbox/ #configuration

  33. [33]

    Git contributors. 2026. Git. Retrieved March 31, 2026 fromhttps: //git-scm.com/

  34. [34]

    Git-LFS contributors. 2026. Git Large File Storage. Retrieved March 31, 2026 fromhttps://git-lfs.com/

  35. [35]

    Amir Goldstein. 2024. [PATCH 0/4] Stash overlay real upper file in backing_file. Retrieved March 30, 2026 fromhttps://lore.kernel. org/all/20241004102342.179434-1-amir73il@gmail.com/

  36. [36]

    Goldstein

    Andrew C. Goldstein. 1975.Files-11 on-disk structure specification. Technical Report. Digital Equipment Corporation, Maynard, MA, USA.https://bitsavers.org/pdf/dec/pdp11/rsx11m_s/Files-11_ODS- 1_Spec_Jun75.pdf

  37. [37]

    Google LLC. 2026. Build, debug & deploy with AI: Gemini CLI. Retrieved March 27, 2026 fromhttps://geminicli.com/

  38. [38]

    Google LLC. 2026. Gemini CLI.https://docs.cloud.google.com/ gemini/docs/codeassist/gemini-cli. Open-source AI agent for the terminal. Accessed: 2026-04-01

  39. [39]

    Ely Greenfield. 2025. Our vision for accelerating creativity and productivity with agentic AI | Adobe Blog. Retrieved April 01, 2026 fromhttps://blog.adobe.com/en/publish/2025/04/09/our-vision-for- accelerating-creativity-productivity-with-agentic-ai

  40. [40]

    Richard G Guy, John S Heidemann, Wai-Kei Mak, Thomas W Page Jr, Gerald J Popek, and Dieter Rothmeier. 1990. Implementation of the Ficus Replicated File System.. InUSENIX Summer, Vol. 90. 63–71

  41. [41]

    Feng He, Tianqing Zhu, Dayong Ye, Bo Liu, Wanlei Zhou, and Philip S Yu. 2025. The emerged security and privacy of llm agent: A survey with case studies.Comput. Surveys58, 6 (2025), 1–36

  42. [42]

    John Heidemann and Gerald Popek. 1995. Performance of cache coherence in stackable filing. InProceedings of the fifteenth ACM symposium on Operating systems principles. 127–141. 14

  43. [43]

    Dave Hitz, James Lau, and Michael A Malcolm. 1994. File system design for an NFS file server appliance. InUSENIX Winter 1994 Tech- nical Conference Proceedings. USENIX Association, San Francisco, CA

  44. [44]

    Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li, Kun Kuang, Yang Yang, Hongxia Yang, and Fei Wu. 2024. InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks. InProceedings of the 41st International Conference on Machine Learning. 19544–19572

  45. [45]

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems43, 2 (2025), 1–55

  46. [46]

    IBM. 2026. IBM DevOps Code ClearCase. Retrieved March 28, 2026 fromhttps://www.ibm.com/products/devops-code-clearcase

  47. [47]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: can language models resolve real-world Github issues?. InThe Twelfth International Conference on Learning Representations

  48. [48]

    Andrej Karpathy. 2026. It is hard to commu- nicate how much programming has changed... https://x.com/karpathy/status/2026731645169185220

  49. [49]

    Ryusuke Konishi, Yoshiji Amagai, Koji Sato, Hisashi Hifumi, Seiji Kihara, and Satoshi Moriai. 2006. The Linux implementation of a log-structured file system.ACM SIGOPS Operating Systems Review 40, 3 (2006), 102–107

  50. [50]

    Alexander Larsson. 2026. GitHub - containers/bubblewrap: Low- level unprivileged sandboxing tool used by Flatpak and similar projects. Retrieved March 28, 2026 fromhttps://github.com/ containers/bubblewrap

  51. [51]

    Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, and Yunxin Liu. 2024. Personal LLM agents: insights and survey about the cap...

  52. [52]

    Linux kernel contributors. 2023. Seccomp BPF (SECure COMPuting with filters) — The Linux Kernel documentation. Retrieved March 28, 2026 fromhttps://docs.kernel.org/userspace-api/seccomp_filter. html

  53. [53]

    Linux kernel contributors. 2026. Overlay Filesystem — The Linux Kernel documentation. Retrieved March 31, 2026 fromhttps://docs. kernel.org/filesystems/overlayfs.html

  54. [54]

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2024. AgentBench: Evaluating LLMs as agents. InThe Twelfth International Conference on Lear...

  55. [55]

    Peter Loscocco. 2001. Integrating flexible support for security poli- cies into the Linux operating system. InProceedings of the FREENIX Track: USENIX Annual Technical Conference

  56. [56]

    Dirk Merkel. 2014. Docker: lightweight linux containers for con- sistent development and deployment.Linux Journal239, 2 (2014), 2

  57. [57]

    Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al . 2026. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868(2026)

  58. [58]

    Meta Platforms, Inc. 2025. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. Retrieved April 01, 2026 fromhttps://ai.meta.com/blog/llama-4-multimodal-intelligence/

  59. [59]

    Microsoft Corporation. 2026. GitHub Copilot in VS Code. Retrieved March 27, 2026 fromhttps://code.visualstudio.com/docs/copilot/ overview

  60. [60]

    Eduardo Mosqueira-Rey, Elena Hernández-Pereira, David Alonso- Ríos, José Bobes-Bascarán, and Ángel Fernández-Leal. 2023. Human- in-the-loop machine learning: a state of the art.Artificial Intelligence Review56, 4 (2023), 3005–3054

  61. [61]

    Yohei Nakajima. 2023. GitHub - yoheinakajima/babyagi_archive. Retrieved March 27, 2026 fromhttps://github.com/yoheinakajima/ babyagi_archive

  62. [62]

    Netapp, Inc. 2026. SnapRestore. Retrieved April 01, 2026 fromhttps://docs.netapp.com/us-en/ontap-apps-dbs/oracle/oracle- dp-snaprestore.html

  63. [63]

    Steve Newman. 2026. 45 Thoughts About Agents. https://secondthoughts.ai/p/45-thoughts-about-agents

  64. [64]

    OpenAI. 2026. Cloud environments – Codex web | OpenAI Devel- opers. Retrieved March 31, 2026 fromhttps://developers.openai. com/codex/cloud/environments

  65. [65]

    OpenAI. 2026. Codex CLI. Retrieved March 27, 2026 fromhttps: //developers.openai.com/codex/cli

  66. [66]

    OpenAI. 2026. Introducing GPT -5.4. Retrieved April 01, 2026 from https://openai.com/index/introducing-gpt-5-4/

  67. [67]

    OpenAI. 2026. OpenAI to acquire Astral. Retrieved March 27, 2026 fromhttps://openai.com/index/openai-to-acquire-astral/

  68. [68]

    OpenAI. 2026. Sandboxing – Codex | OpenAI Developers. Retrieved March 28, 2026 fromhttps://developers.openai.com/codex/concepts/ sandboxing

  69. [69]

    OpenCode. 2026. OpenCode.https://opencode.ai/docs/. Open- source AI coding agent for terminal, desktop, and IDE workflows. Accessed: 2026-04-01

  70. [70]

    Melissa Z. Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, Shu Liu, Tianneng Shi, Xiaoyuan Liu, Jared Quincy Davis, Emmanuele Lacavalla, Alessandro Basile, Shuyi Yang, Paul Castro, Daniel Kang, Joseph E. Gonzalez, Koushik Sen, Dawn Song, Ion Stoica, Matei Zaharia, and ...

  71. [71]

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. 2025. The Berke- ley Function Calling Leaderboard (BFCL): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning

  72. [72]

    Hugo Patterson and Stephen Manley

    R. Hugo Patterson and Stephen Manley. 2002. SnapMir- ror: File-system-based asynchronous mirroring for dis- aster recovery. InConference on File and Storage Tech- nologies (FAST 02). USENIX Association, Monterey, CA. https://www.usenix.org/conference/fast-02/snapmirror-file- system-based-asynchronous-mirroring-disaster-recovery

  73. [73]

    Grant A Pignatiello, Richard J Martin, and Ronald L Hickman Jr

  74. [74]

    Decision fatigue: A conceptual analysis.Journal of health psychology25, 1 (2020), 123–135

  75. [75]

    David Quigley, Josef Sipek, Charles P Wright, and Erez Zadok. 2006. Unionfs: User-and community-oriented development of a unifica- tion filesystem. InProceedings of the 2006 Linux Symposium, Vol. 2. 349–362

  76. [76]

    Sean Quinlan and Sean Dorward. 2002. Venti: A new ap- proach to archival data storage. InConference on File and Storage Technologies (FAST 02). USENIX Association, Monterey, CA.https://www.usenix.org/conference/fast-02/venti-new- approach-archival-data-storage 15

  77. [77]

    2003.Fossil, an archival file server

    Sean Quinlan, Jim McKie, and Russ Cox. 2003.Fossil, an archival file server. Technical Report. Plan 9. Retrieved March 27, 2026 from https://p9f.org/sys/doc/fossil.pdf

  78. [78]

    Relynt. 2026. Designing approvals that do not kill automation | Re- lynt Blog. Retrieved March 28, 2026 fromhttps://www.relyntpolicy. com/blog/slack-approvals-human-in-the-loop

  79. [79]

    Ritchie and Ken Thompson

    Dennis M. Ritchie and Ken Thompson. 1974. The UNIX time-sharing system.Commun. ACM17, 7 (July 1974), 365–375.https://doi.org/ 10.1145/361011.361061

  80. [80]

    Ohad Rodeh, Josef Bacik, and Chris Mason. 2013. BTRFS: The Linux B-tree filesystem.ACM Transactions on Storage9, 3 (2013), 1–32

Showing first 80 references.