arxiv: 2605.07709 · v1 · submitted 2026-05-08 · 💻 cs.SE

Recognition: no theorem link

SafeTune: Search-based Harmfulness Minimisation for Large Language Models

Giordano d'Aloisio , David Williams , Giusy Annunziata , Zhiwei Fei , Antinisca Di Marco , Federica Sarro

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:32 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM safetyharmfulness minimisationhyperparameter tuningsystem prompt engineeringsearch-based optimizationresponse relevancerepetition parameter

0 comments

The pith

SafeTune reduces harmful responses from Qwen3.5 0.8B while increasing relevance through search over hyperparameters and prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SafeTune as a method that searches across model settings and system instructions to make LLM outputs less harmful and more tied to the original prompt. It begins by checking harm levels in four standard LLMs and then tests the new approach on Qwen3.5 0.8B. The evaluation finds clear drops in harmful content and gains in relevance, both with large effect sizes. Among the settings tested, prompting the model to repeat content more often turns out to drive the biggest gains in both goals.

Core claim

SafeTune is a multi-objective search-based approach to mitigate harmfulness while increasing response relevance through hyperparameter tuning and system prompt engineering. Initial evaluation shows that SafeTune significantly reduces the rate of harmful responses generated by Qwen3.5 0.8B and increases prompt-response relevance, both with large effect sizes. Among the parameters explored, encouraging greater repetition in responses is most impactful in reducing harmfulness while increasing relevance.

What carries the argument

Multi-objective search over hyperparameters such as repetition and over system prompts to jointly minimize harm and maximize relevance.

Load-bearing premise

That the multi-objective search on hyperparameters and prompts will reliably reduce harm and increase relevance across models and contexts without introducing new unintended behaviors or biases.

What would settle it

Applying SafeTune to other LLMs besides Qwen3.5 0.8B and measuring whether harmful response rates fall and relevance rises without new problems appearing.

Figures

Figures reproduced from arXiv: 2605.07709 by Antinisca Di Marco, David Williams, Federica Sarro, Giordano d'Aloisio, Giusy Annunziata, Zhiwei Fei.

**Figure 2.** Figure 2: RQ2: Qwen3.5 0.8B (baseline) vs SafeTune-optimised Qwen3.5 0.8B. 0.0 0.2 0.4 Repetition Penalty System Prompt Top P Top K Max New Tokens Temperature (a) Harmfulness 0.0 0.2 0.4 0.6 Repetition Penalty Top P Top K System Prompt Temperature Max New Tokens (b) Relevance [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: RQ3: Feature Importance Scores Results [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

The widespread adoption of Large Language Models (LLMs) raises concerns about the potential harmfulness of their responses. In this paper, we first investigate the harmfulness of responses from four general-purpose LLMs. Next, we propose SafeTune, a multi-objective search-based approach to mitigate harmfulness while increasing response relevance through hyperparameter tuning and system prompt engineering. Our initial evaluation shows that SafeTune significantly reduces the rate of harmful responses generated by Qwen3.5 0.8B and increases prompt-response relevance (both with a large effect size). Among the parameters we explore, we also find that encouraging greater repetition in responses is most impactful in reducing harmfulness while increasing relevance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SafeTune applies multi-objective search to tune prompts and hyperparameters for lower harm and higher relevance in LLMs, with a reported win on one small model via more repetition, but the harm metric itself is unspecified.

read the letter

The key takeaway is that SafeTune frames harm reduction as a search problem over system prompts and a few hyperparameters, then claims large-effect-size gains on Qwen3.5 0.8B plus a side finding that forcing more repetition helps both objectives. They start with a quick survey of harm rates across four off-the-shelf models, which gives a modest baseline, and then treat the tuning as a straightforward multi-objective optimization task. That combination is not revolutionary, but it is a clean engineering move that practitioners could try without retraining. The repetition result is the part that stands out; it suggests a cheap lever that might dilute risky content or simply change the output distribution in a way that scores better on whatever detector they used. If that holds, it is worth testing on other models. The soft spot is the measurement of harm. The abstract gives no protocol for what counts as harmful, no benchmark name, no judge details, and no inter-rater numbers. Without that, it is impossible to tell whether the search is actually lowering risk or just learning to evade the scorer, especially since repetitive text can look less harmful to many classifiers by construction. They also only show results on a single 0.8B model, so generalization is still open. This paper is aimed at engineers who deploy small LLMs and want lightweight safety knobs rather than full alignment work. Readers who already use search or prompt optimization will see the setup quickly and can judge whether to replicate it once the metric is spelled out. I would send it to peer review. The idea is practical and the experimental skeleton is there; referees can ask for the missing evaluation details and tests on additional models without needing to reject the work outright.

Referee Report

3 major / 2 minor

Summary. The paper proposes SafeTune, a multi-objective search-based method that optimizes LLM hyperparameters and system prompts to reduce response harmfulness while increasing prompt-response relevance. It first surveys harmfulness in four general-purpose LLMs, then reports that SafeTune applied to Qwen3.5 0.8B yields statistically notable reductions in harmful response rate and gains in relevance (both with large effect sizes), with greater repetition emerging as the most impactful parameter.

Significance. If the underlying harm and relevance metrics prove robust and generalizable, SafeTune would provide a lightweight, training-free alternative to conventional alignment methods for improving LLM safety. The emphasis on repetition as a high-impact lever and the multi-objective framing could usefully inform prompt-engineering practice, but the absence of concrete evaluation protocols currently prevents assessment of whether these gains reflect genuine safety improvements or metric-specific artifacts.

major comments (3)

[Evaluation section] Evaluation section (and abstract): The headline claim of large-effect-size reductions in harmful response rate for Qwen3.5 0.8B is presented without any description of the harmfulness detector (LLM judge, keyword list, benchmark, or human guidelines), the evaluation prompts, inter-rater reliability, or statistical tests. Because SafeTune searches over prompts and repetition parameters on the same distribution used for reporting, this omission directly undermines both the harm-reduction and relevance-increase results; the search could simply be producing outputs that evade the particular detector.
[Method and Results sections] Method and Results sections: The multi-objective search is described as jointly optimizing harm and relevance, yet no held-out test set, cross-validation procedure, or external baseline comparisons are mentioned. Without these, it is impossible to determine whether the reported gains generalize beyond the search distribution or simply reflect overfitting to the (unspecified) evaluation metric.
[Results section] Results section: The claim that 'encouraging greater repetition' is the most impactful parameter for simultaneously lowering harm and raising relevance requires explicit operationalization of repetition (e.g., n-gram overlap, sentence-level metrics) and evidence that the effect is not an artifact of how the harm classifier scores repetitive text. This is load-bearing for the parameter-sensitivity conclusion.

minor comments (2)

[Abstract and Introduction] The abstract and introduction would benefit from a brief statement of the four LLMs surveyed and the precise definition of 'prompt-response relevance' used in the multi-objective objective.
[Method section] Notation for the search objectives and the repetition parameter should be introduced consistently when first used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments correctly identify areas where greater transparency and rigor are needed in the evaluation and methodological descriptions. We address each major comment below and will make corresponding revisions to the manuscript.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (and abstract): The headline claim of large-effect-size reductions in harmful response rate for Qwen3.5 0.8B is presented without any description of the harmfulness detector (LLM judge, keyword list, benchmark, or human guidelines), the evaluation prompts, inter-rater reliability, or statistical tests. Because SafeTune searches over prompts and repetition parameters on the same distribution used for reporting, this omission directly undermines both the harm-reduction and relevance-increase results; the search could simply be producing outputs that evade the particular detector.

Authors: We agree that the Evaluation section and abstract require substantially more detail on the harmfulness assessment procedure. In the revised manuscript we will add a complete specification of the detector (including its implementation, prompt template if applicable, and any validation steps), the exact evaluation prompts, inter-rater or reliability statistics, and the statistical tests with effect-size reporting. To address the risk that the search merely evades the detector, we will also report results on a held-out prompt set that was not used during the multi-objective search, thereby demonstrating that the observed reductions are not limited to the optimization distribution. revision: yes
Referee: [Method and Results sections] Method and Results sections: The multi-objective search is described as jointly optimizing harm and relevance, yet no held-out test set, cross-validation procedure, or external baseline comparisons are mentioned. Without these, it is impossible to determine whether the reported gains generalize beyond the search distribution or simply reflect overfitting to the (unspecified) evaluation metric.

Authors: The absence of an explicit held-out evaluation protocol is a genuine limitation in the current draft. We will revise the Method section to describe the prompt distribution, introduce a held-out test partition, and document the cross-validation or split procedure used for final reporting. We will also add comparisons against simple baselines (standard system prompts and single-objective tuning) to provide external context and help readers assess whether the gains exceed what would be expected from overfitting to the particular metric. revision: yes
Referee: [Results section] Results section: The claim that 'encouraging greater repetition' is the most impactful parameter for simultaneously lowering harm and raising relevance requires explicit operationalization of repetition (e.g., n-gram overlap, sentence-level metrics) and evidence that the effect is not an artifact of how the harm classifier scores repetitive text. This is load-bearing for the parameter-sensitivity conclusion.

Authors: We accept that the current treatment of repetition is insufficiently precise. In the revised Results section we will define repetition operationally (e.g., via n-gram overlap ratios and sentence-level repetition counts induced by the system prompt) and present the parameter-sensitivity analysis with these metrics. To rule out classifier artifacts we will add a supplementary check consisting of manual inspection of a random sample of responses together with an alternative harm metric; this will show whether the harm reduction persists when repetition is controlled for. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SafeTune derivation

full rationale

The paper describes an empirical search-based tuning procedure (hyperparameter and prompt optimization) followed by separate evaluation measurements on harmfulness and relevance. No equations, self-definitions, or fitted parameters are relabeled as independent predictions. No load-bearing self-citations or uniqueness theorems appear in the provided text. The central claims rest on reported experimental outcomes rather than reducing to inputs by construction. This is the expected non-circular outcome for an applied search/optimization paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond standard LLM evaluation practices.

pith-pipeline@v0.9.0 · 5428 in / 1039 out tokens · 54356 ms · 2026-05-11T02:32:50.898358+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

arXiv preprint arXiv:2509.24384 (2025)

Yang, L.et al.: HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment. arXiv preprint arXiv:2509.24384 (2025)

work page arXiv 2025
[2]

In: ICSE (2026)

d’Aloisio, G., Fadahunsi, T., Choy, J., Moussa, R., Sarro, F.: SustainDiffusion: Optimising the social and environmental sustainability of Stable Diffusion models. In: ICSE (2026)

work page 2026
[3]

In: ICSE-SEIS (2026)

d’Alosio, G., Hort, M., Moussa, R., Sarro, F.: FairRF: Multi-Objective Search for Single and Intersectional Software Fairness. In: ICSE-SEIS (2026)

work page 2026
[4]

In: SSBSE (2024)

Gong, J.et al.: Greenstableyolo: Optimizing inference time and image quality of text-to-image generation. In: SSBSE (2024)

work page 2024
[5]

In: 2023 IEEE (RE) (2023)

Sarro, F.: Search-Based Software Engineering in the Era of Modern Software Sys- tems. In: 2023 IEEE (RE) (2023)

work page 2023
[6]

TSE (2025)

Corbo, S.et al.: How Toxic Can You Get? Search-Based Toxicity Testing for Large Language Models. TSE (2025)

work page 2025
[7]

TOSEM (2025)

Zhuo, T.Y., Huang, Y., Chen, C., Du, X., Xing, Z.: Bypassing Guardrails: Lessons Learned from Red Teaming ChatGPT. TOSEM (2025)

work page 2025
[8]

In: ICML (2024)

Mazeika, M.et al.: HarmBench: a standardized evaluation framework for auto- mated red teaming and robust refusal. In: ICML (2024)

work page 2024
[9]

In: EMNLP (2019)

Reimers, N., Gurevych, I.: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In: EMNLP (2019)

work page 2019
[10]

Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. Trans. Evol. Comp (2002)

work page 2002
[11]

In: GECCO (2007)

Deb, K., Sindhya, K., Okabe, T.: Self-adaptive simulated binary crossover for real- parameter optimization. In: GECCO (2007)

work page 2007
[12]

Biometrics (1954)

Cochran, W.G.: Some methods for strengthening the commonχ2 tests. Biometrics (1954)

work page 1954
[13]

ACM Computing Surveys (2021)

Guerreiro, A.P., Fonseca, C.M., Paquete, L.: The Hypervolume Indicator: Compu- tational Problems and Algorithms. ACM Computing Surveys (2021)

work page 2021