SHIELDS: Automating OS Hardening with Iterative Multi-Agent Remediation
Pith reviewed 2026-06-28 05:06 UTC · model grok-4.3
The pith
SHIELDS multi-agent system remediates up to 73% of OS security scan findings using iterative LLM feedback.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SHIELDS uses large language models in a multi-agent setup to treat OS hardening as an iterative, feedback-driven process. Instead of fixed remediations, it continuously proposes fixes and refines them based on target system execution and validation scans. Across evaluations, it successfully remediates up to 73% of scan findings, with success depending less on model size than on effective tool use and information gathering.
What carries the argument
The iterative multi-agent remediation loop where LLMs propose, execute, and validate fixes using feedback from the target system and scans.
If this is right
- Automates compliance tasks that currently require manual effort or static tools.
- Allows effective use of smaller LLMs in security-sensitive environments.
- Supports local model deployment where privacy or compute limits apply.
- Reduces burden of maintaining OS compliance with standards like STIGs.
Where Pith is reading between the lines
- Similar iterative approaches could extend to other security domains beyond OS hardening.
- If the feedback loop proves reliable, it might minimize the need for human oversight in remediation.
- Testing on real-world production systems rather than VMs could reveal additional challenges.
- The method might integrate with existing compliance tools to enhance their capabilities.
Load-bearing premise
That the iterative feedback from system execution and scans is sufficient for LLMs to produce correct fixes without introducing new vulnerabilities or needing human intervention.
What would settle it
An experiment showing that after SHIELDS remediation, a validation scan reports new or additional findings not present before, or that fixes cause system instability.
read the original abstract
Security misconfigurations remain a leading cause of OS-level compromise, and manually keeping systems compliant with standards like Defense Information Systems Agency (DISA) Security Technical Implementation Guides (STIGs) is a tedious and expensive process. Existing compliance automation tools can reduce some of this burden, but they depend on static, pre-written corrective actions. In this paper, we introduce SHIELDS, a multi-agent system that uses large language models (LLMs) to approach OS hardening as an iterative, feedback-driven process. Instead of applying fixed remediations, SHIELDS continuously proposes fixes and refines them based on feedback from target system execution and validation scans. We evaluate the system across multiple virtual machine configurations using six contemporary LLMs ranging from 20B to 400B parameters, and find that SHIELDS successfully remediates up to 73% of scan findings. Our results also suggest that success in this setting depends less on model size (parameter count) than on effective tool use and information gathering, paving a practical path toward reducing the burden of security compliance in environments where compute is limited or security and privacy needs drive local model use.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SHIELDS, a multi-agent LLM system for OS hardening that treats compliance as an iterative, feedback-driven process: agents propose fixes, execute them on target VMs, and refine based on execution outcomes and validation scans against standards such as DISA STIGs. Across six LLMs (20B–400B parameters) and multiple VM configurations, the system is reported to remediate up to 73% of scan findings, with the central empirical claim that success depends more on effective tool use and information gathering than on model parameter count.
Significance. If the empirical results prove robust under controlled conditions, the work offers a practical route to reducing manual effort in security compliance, particularly for local or privacy-sensitive deployments where smaller models are preferred. The finding that tool-use effectiveness outweighs scale would be a useful contribution to the design of agentic security tools.
major comments (2)
- [Evaluation / Results] The abstract and evaluation description report a 73% remediation rate but supply no information on the number of trials per configuration, statistical error bars or variance across runs, baseline comparisons against static remediation scripts or existing compliance tools, or the protocol for handling post-hoc selection of successful runs. These omissions are load-bearing for the central claim that SHIELDS achieves reliable remediation.
- [Methodology / Evaluation] The evaluation protocol relies on the assumption that iterative feedback from target-system execution and validation scans is sufficient for LLMs to produce correct, non-regressive fixes without introducing new vulnerabilities. No description is given of safeguards, rollback mechanisms, or systematic failure-mode analysis that would substantiate this assumption.
minor comments (2)
- [Abstract] The abstract states that success 'depends less on model size than on effective tool use' but does not quantify this comparison (e.g., via ablation on tool availability or information-gathering steps).
- [System Design] Notation for agent roles, tool interfaces, and scan-result representations is introduced without a consolidated table or diagram, making it harder to follow the multi-agent workflow.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the evaluation and methodology. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of results and safeguards.
read point-by-point responses
-
Referee: [Evaluation / Results] The abstract and evaluation description report a 73% remediation rate but supply no information on the number of trials per configuration, statistical error bars or variance across runs, baseline comparisons against static remediation scripts or existing compliance tools, or the protocol for handling post-hoc selection of successful runs. These omissions are load-bearing for the central claim that SHIELDS achieves reliable remediation.
Authors: We agree that these details are essential to support the central claims. In the revised manuscript we will expand the evaluation section to report the number of trials per configuration, include statistical measures such as variance and error bars across runs, add baseline comparisons to static remediation scripts and existing compliance tools, and explicitly describe the result-reporting protocol (with no post-hoc selection of runs). revision: yes
-
Referee: [Methodology / Evaluation] The evaluation protocol relies on the assumption that iterative feedback from target-system execution and validation scans is sufficient for LLMs to produce correct, non-regressive fixes without introducing new vulnerabilities. No description is given of safeguards, rollback mechanisms, or systematic failure-mode analysis that would substantiate this assumption.
Authors: We acknowledge the need to substantiate this assumption. The revised manuscript will add a subsection detailing the safeguards employed (including rollback via VM snapshots), post-fix validation scans to detect regressions, and a systematic failure-mode analysis of observed errors and non-regressive outcomes from the experiments. revision: yes
Circularity Check
No significant circularity; empirical measurement only
full rationale
The paper reports an empirical evaluation of the SHIELDS multi-agent system on virtual machines across six LLMs. The central result (up to 73% remediation of scan findings) is a direct experimental measurement, not a quantity derived from equations, fitted parameters, or self-referential definitions. No load-bearing steps reduce by construction to the paper's own inputs. The work is self-contained against external benchmarks (DISA STIG scans on VMs) with no self-citation chains or ansatzes invoked for the outcome.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs supplied with execution feedback and scanner results can iteratively produce correct OS configuration fixes
Reference graph
Works this paper leans on
-
[1]
IBM (2024)
IBM Security and Ponemon Institute: Cost of a Data Breach 9 Report 2024. IBM (2024). https://www.ibm.com/think/insights/ whats-new-2024-cost-of-a-data-breach-report
2024
-
[2]
https://www
Verizon Business: 2024 Data Breach Investigations Report (2024). https://www. verizon.com/business/resources/reports/dbir/
2024
-
[3]
https://www
Verizon Business: 2025 Data Breach Investigations Report (2025). https://www. verizon.com/business/resources/reports/dbir/
2025
-
[4]
https://www
SteelCloud: STIG Automation for Continuous DISA Compliance. https://www. steelcloud.com/automate-disa-stig-compliance/
-
[5]
ComplianceAsCode: Security Automation Content in SCAP, Bash, Ansible, and Other Formats. GitHub. https://github.com/ComplianceAsCode/content
-
[6]
Ansible Lockdown: Automated STIG Benchmark Compliance Remediation. GitHub. https://github.com/ansible-lockdown
-
[7]
Microsoft: PowerSTIG: STIG Automation. GitHub. https://github.com/ microsoft/PowerStig
-
[8]
https://www.redhat.com/en/blog/ center-internet-security-cis-compliance-red-hat-enterprise-linux-using-openscap
Red Hat: Center for Internet Security (CIS) Compliance in Red Hat Enterprise Linux Using OpenSCAP (2025). https://www.redhat.com/en/blog/ center-internet-security-cis-compliance-red-hat-enterprise-linux-using-openscap
2025
- [9]
- [10]
- [11]
-
[12]
Machine Learning with Applications 17, 100570 (2024) https://doi.org/10.1016/j.mlwa.2024.100570
Cao, C., Wang, F., Lindley, L., Wang, Z.: Managing linux servers with llm-based ai agents: An empirical evaluation with gpt4. Machine Learning with Applications 17, 100570 (2024) https://doi.org/10.1016/j.mlwa.2024.100570
-
[13]
In: Proceedings of the 23rd ACM Workshop on Hot Topics in Networks, pp
Liu, X., Zhang, P., Abhashkumar, A., Chen, J., Jiang, W.: Automatic config- uration repair. In: Proceedings of the 23rd ACM Workshop on Hot Topics in Networks, pp. 213–220 (2024)
2024
-
[14]
Wang, X., Tian, Y., Huang, K., Liang, B.: Practically implementing an llm-supported collaborative vulnerability remediation process: A team-based approach. Computers & Security148, 104113 (2025) https://doi.org/10.1016/j. cose.2024.104113
work page doi:10.1016/j 2025
-
[15]
Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents
Talebirad, Y., Nadiri, A.: Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
In: Arabnia, H.R., Deligiannidis, L., Amirian, S., Ghareh Mohammadi, F., Shenavarmasouleh, F
Horne, D.: The agentic ai mindset – a practitioner’s guide to architectures, pat- terns, and future directions for autonomy and automation. In: Arabnia, H.R., Deligiannidis, L., Amirian, S., Ghareh Mohammadi, F., Shenavarmasouleh, F. (eds.) AI Revolution: Research, Ethics and Society, pp. 434–455. Springer, Cham (2026) 10
2026
-
[17]
In: 2026 International Conference on Emerging Technologies and Future Innovations (ETFI), pp
Rokade, R.Y., Dhakulkar, B.: A survey of ai-driven stig automation techniques in modern devsecops environments. In: 2026 International Conference on Emerging Technologies and Future Innovations (ETFI), pp. 1–7 (2026). https://doi.org/10. 1109/ETFI68128.2026.11484642
-
[18]
Mercury: Ultra-Fast Language Models Based on Diffusion
Inception: Mercury: Ultra-Fast Language Models Based on Diffusion (2025). https://arxiv.org/abs/2506.17298
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Accessed: 2026-05-25
SHIELDS Capstone Project Team: timothyk31/s26 capstone l3: Shields capstone project spring 2026. Accessed: 2026-05-25
2026
-
[20]
https://arxiv.org/abs/2602
Arcee: Arcee Trinity Large Technical Report (2026). https://arxiv.org/abs/2602. 17004
2026
-
[21]
https://huggingface.co/google/ gemma-4-26b-a4b-it
Google: Gemma-4-26B-A4B-it. https://huggingface.co/google/ gemma-4-26b-a4b-it. Accessed: 2026-04-26 (2026)
2026
-
[22]
https://arxiv.org/abs/2604
NVIDIA: Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba- Transformer Model for Agentic Reasoning (2026). https://arxiv.org/abs/2604. 12374
2026
-
[23]
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI: gpt-oss-120b & gpt-oss-20b Model Card (2025). https://arxiv.org/abs/ 2508.10925
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
(none)"} • Recommendation:{vulnerability.recommendation or
Google DeepMind: FunctionGemma-270M-IT. https://huggingface.co/google/ functiongemma-270m-it. Accessed: 2026-04-29 (2025) Appendix A Agent Prompts This section contains the prompts we use for our Remedy, Review, QA, and Triage agents in all experiments. A.1 Remedy Agent Remedy Agent System Prompt You are an adaptive remediation agent on Rocky Linux / RHEL...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.