arxiv: 2604.24444 · v1 · submitted 2026-04-27 · 💻 cs.CL

Recognition: unknown

Can You Make It Sound Like You? Post-Editing LLM-Generated Text for Personal Style

Connor Baumler , Calvin Bao , Huy Nghiem , Xinchen Yang , Marine Carpuat , Hal Daum\'e III

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:49 UTC · model grok-4.3

classification 💻 cs.CL

keywords post-editingLLM-generated textpersonal stylestylistic similarityhuman-AI writingstyle metricscollaborative writing

0 comments

The pith

Post-editing LLM-generated drafts raises stylistic similarity to a writer's own unassisted text while lowering similarity to pure model output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports a pre-registered study in which 81 participants edited LLM drafts for tasks where personal style mattered. Embedding-based metrics showed that the edited versions moved closer to each participant's control writing and farther from untouched LLM text. Even after editing, however, the results stayed stylistically nearer to the original model output than to the participant's own natural writing and displayed lower stylistic diversity overall. Participants often judged the edited text as representative of their personal style despite these measurable model traces.

Core claim

In the study, post-editing increased stylistic similarity to participants' unassisted writing and reduced similarity to fully LLM-generated output, yet post-edited text remained closer in style to LLM text than to the unassisted control text and exhibited reduced stylistic diversity compared with unassisted human text.

What carries the argument

Embedding-based style similarity metrics that quantify how much post-edited text aligns with a participant's own unassisted writing versus untouched LLM output.

If this is right

Post-editing shifts output toward a writer's style but does not fully eliminate LLM stylistic patterns.
Edited text shows lower stylistic diversity than writing produced without LLM assistance.
Writers may perceive post-edited text as authentic even when objective metrics detect model influence.
Pure generation and post-editing produce different degrees of stylistic personalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Tools that help users rewrite LLM drafts could focus on increasing output diversity to close the remaining gap.
For tasks where exact personal voice is required, users may still need to write from scratch rather than edit.
The observed perception gap suggests that current similarity metrics may not fully align with reader judgments of authenticity.

Load-bearing premise

Embedding-based metrics capture what counts as personal style in a way that matches human perception and practical importance.

What would settle it

A controlled experiment in which independent human readers rate post-edited text as stylistically indistinguishable from unassisted human writing at rates equal to or higher than LLM text would falsify the claim that detectable LLM traces remain.

Figures

Figures reproduced from arXiv: 2604.24444 by Calvin Bao, Connor Baumler, Hal Daum\'e III, Huy Nghiem, Marine Carpuat, Xinchen Yang.

**Figure 1.** Figure 1: Study overview: pre-survey, tutorial, two randomized task blocks (treatment and control) with task view at source ↗

**Figure 2.** Figure 2: The main writing task. In both treatment and control blocks, participants plan details alone (top and view at source ↗

**Figure 3.** Figure 3: Similarity to LLM-generated text (left) and control text (right) before and after post-editing. Through post-editing, text is more similar to control text (H1a) and less similar to LLM-generated text (H1b). own unassisted style than it does to other participants’ style (H1a′ , p = .0002, g = −0.56, 95% CI: [−0.7, −0.43], view at source ↗

**Figure 5.** Figure 5: Similarity of post-edited text to participants’ view at source ↗

**Figure 6.** Figure 6: Stylistic similarity between pairs of text from view at source ↗

**Figure 8.** Figure 8: Relationship between participants’ perception view at source ↗

**Figure 9.** Figure 9: Relationship between participants’ perception view at source ↗

**Figure 10.** Figure 10: Comparison of participant’s desires for vari view at source ↗

**Figure 12.** Figure 12: Linguistic acceptability (model-judged) of view at source ↗

**Figure 11.** Figure 11: Comparison of feature prioritization between view at source ↗

**Figure 13.** Figure 13: Length and number of post-edited spans. We see that a) spans are skewed such that the average span is view at source ↗

**Figure 14.** Figure 14: Number of participants selecting each candidate scenario as one where personal style is important. The view at source ↗

**Figure 15.** Figure 15: Importance of personal style in selected tasks in a) the formative survey and b) the main study. view at source ↗

**Figure 16.** Figure 16: Pre-survey view at source ↗

**Figure 17.** Figure 17: Task difficulty survey. An identical survey is provided after the treatment and control task blocks. view at source ↗

**Figure 18.** Figure 18: First half of the post-study survey. In the interface, both these questions and those in view at source ↗

**Figure 19.** Figure 19: Second half of the post-study survey. In the interface, both these questions and those in view at source ↗

**Figure 20.** Figure 20: Sample study task. In a) the participant has finished planning the details to include in their catch-up view at source ↗

**Figure 21.** Figure 21: Post-task survey questions. In the control block, participants are not asked about the LLM-generated view at source ↗

read the original abstract

Despite the growing use of large language models (LLMs) for writing tasks, users may hesitate to rely on LLMs when personal style is important. Post-editing LLM-generated drafts or translations is a common collaborative writing strategy, but it remains unclear whether users can effectively reshape LLM-generated text to reflect their personal style. We conduct a pre-registered online study ($n=81$) in which participants post-edit LLM-generated drafts for writing tasks where personal style matters to them. Using embedding-based style similarity metrics, we find that post-editing increases stylistic similarity to participants' unassisted writing and reduces similarity to fully LLM-generated output. However, post-edited text still remains stylistically closer in style to LLM text than to participants' unassisted control text, and it exhibits reduced stylistic diversity compared to unassisted human text. We find a gap between perceived stylistic authenticity and model-measured stylistic similarity, with post-edited text often perceived as representative of participants' personal style despite remaining detectable LLM stylistic traces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Post-editing LLM drafts moves text closer to personal style but not all the way and reduces diversity, with the embedding metrics carrying most of the weight.

read the letter

Post-editing LLM drafts brings the text closer to a person's own style but leaves it still more like the LLM output, and it cuts down on stylistic variety. The paper runs a pre-registered study with 81 people who post-edit LLM text for writing tasks important to their personal style. They measure changes with sentence embeddings and find the expected shifts, plus a gap where people feel the edited text represents them even if metrics show LLM traces remain. The controlled comparison to unassisted writing is a strength, and the pre-registration helps avoid fishing for results. The weak point is that the style similarity claims rest on cosine distances in embeddings without any direct check against human ratings of how similar two texts feel in style. It's possible the metric picks up other things, which would weaken the main conclusion about not fully recovering personal style. They do report the perception mismatch, but without correlating it to the numbers, it stays separate. This work is aimed at people studying human-AI collaboration in writing. The empirical results are new enough to be worth citing if you're looking at post-editing specifically. It should go to peer review; the design is clear and the question practical, so referees can help tighten the metric part and see if the findings hold.

Referee Report

1 major / 2 minor

Summary. The paper reports results from a pre-registered online study (n=81) in which participants post-edit LLM-generated drafts on writing tasks where personal style is salient. Using standard sentence-embedding cosine similarities, the authors conclude that post-editing increases stylistic similarity to participants' unassisted writing and decreases similarity to fully LLM-generated text, yet post-edited output remains closer to LLM style than to human control text and shows lower stylistic diversity; they additionally document a gap between participants' perceptions of authenticity and the embedding-based measurements.

Significance. If the embedding metrics are shown to track human-perceived personal style, the work supplies useful empirical bounds on what post-editing can achieve in stylistically sensitive domains and highlights a persistent LLM trace that users may not notice. The pre-registered design with explicit human controls is a clear methodological strength that supports reproducibility and direct comparison.

major comments (1)

[§5] §5 (Results) and the embedding analysis: the central quantitative claims—that post-edited text is still closer to LLM output than to unassisted human text and exhibits reduced diversity—rest entirely on cosine similarities in sentence embeddings. No correlation is reported between these distances and human pairwise style-similarity ratings collected on the same texts, leaving open the possibility that the metric is sensitive to topic, length, or fluency confounds rather than the intended personal stylistic features.

minor comments (2)

[Abstract and §5] The abstract and §5 could report the exact statistical tests, effect sizes, and p-values supporting the similarity and diversity comparisons rather than qualitative descriptions alone.
[Figures] Figure captions and axis labels for the embedding similarity plots should explicitly state the embedding model and preprocessing steps used to compute the reported cosines.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their constructive review and for recognizing the strengths of our pre-registered design with human controls. We address the single major comment below regarding validation of the embedding metrics.

read point-by-point responses

Referee: [§5] §5 (Results) and the embedding analysis: the central quantitative claims—that post-edited text is still closer to LLM output than to unassisted human text and exhibits reduced diversity—rest entirely on cosine similarities in sentence embeddings. No correlation is reported between these distances and human pairwise style-similarity ratings collected on the same texts, leaving open the possibility that the metric is sensitive to topic, length, or fluency confounds rather than the intended personal stylistic features.

Authors: We appreciate this observation on metric validation. Our study collected participants' self-reported ratings of perceived stylistic authenticity for the post-edited outputs (as noted in the abstract and §5), but we did not collect pairwise human style-similarity judgments between texts. Consequently, we cannot compute or report a correlation between embedding distances and such pairwise ratings. We selected sentence embeddings (Sentence-BERT) as they are standard in the field for isolating stylistic features while being relatively robust to topical content; our task prompts were designed to hold topic roughly constant while emphasizing personal voice, and post-editing naturally affects fluency and length in ways that we report descriptively. We acknowledge that residual confounds remain possible without direct human correlation data. In revision we will expand the Discussion to (a) cite prior validation studies of embeddings for style, (b) explicitly note the absence of pairwise human ratings as a limitation, and (c) highlight the observed divergence between authenticity perceptions and embedding distances as evidence that the metric is capturing LLM traces not fully aligned with human self-perception. These additions will be made to §5 and the Discussion without new data collection. revision: partial

standing simulated objections not resolved

We cannot report a correlation between embedding distances and human pairwise style-similarity ratings because such pairwise ratings were not collected in the study.

Circularity Check

0 steps flagged

No significant circularity: purely empirical measurement study

full rationale

The paper reports results from a pre-registered online study (n=81) that directly measures stylistic similarity via embedding-based metrics on post-edited, LLM-generated, and unassisted human texts. No derivations, equations, fitted parameters, or self-citation chains are invoked to support the central claims; the findings are straightforward empirical comparisons of cosine similarities and diversity statistics computed on the collected data. The analysis is self-contained against external benchmarks (participant texts) without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that embedding-based metrics validly quantify personal style and that the online study tasks elicit representative personal writing.

axioms (2)

domain assumption Embedding-based similarity metrics capture meaningful aspects of personal writing style
Used as the primary outcome measure without reported human validation of the metric against perceived authenticity.
domain assumption The writing tasks chosen elicit stable personal style that can be compared across conditions
Required for the before/after and control comparisons to be meaningful.

pith-pipeline@v0.9.0 · 5494 in / 1242 out tokens · 39150 ms · 2026-05-08T03:49:13.274567+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beating the Style Detector: Three Hours of Agentic Research on the AI-Text Arms Race
cs.CL 2026-05 unverdicted novelty 6.0

Agentic reproduction of an NLP study recovers original findings and demonstrates that GPT-5.5 and Claude Opus can reduce their AI-detection probability by shrinking detector margins through 20 feedback iterations.

Reference graph

Works this paper leans on

5 extracted references · 2 canonical work pages · cited by 1 Pith paper

[1]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34855–34880, Suzhou, China

Leveraging multilingual training for author- ship representation: Enhancing generalization across languages and domains. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34855–34880, Suzhou, China. Association for Computational Linguistics. Kevin Knight and Ishwar Chander. 1994. Automated postediting of docum...

2025
[2]

John Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi

The widespread adoption of large language model-assisted writing across society.Patterns, 6(12):101366. John Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. 2020. TextAttack: A frame- work for adversarial attacks, data augmentation, and adversarial training in NLP. InProceedings of the 2020 Conference on Empirical Methods in Natu- ...

work page arXiv 2020
[3]

InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (V olume 1: Long Papers), pages 5342–5373, Vienna, Austria

People who frequently use ChatGPT for writ- ing tasks are accurate and robust detectors of AI- generated text. InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (V olume 1: Long Papers), pages 5342–5373, Vienna, Austria. Association for Computational Lin- guistics. Chantal Shaib, Yanai Elazar, Junyi Jessy Li, and ...

2024
[4]

Style transfer for texts: Retrain, report er- rors, compare with rewrites. InProceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3936–3945, Hong Kong, China. Association for Computational Linguistics. Alex Warstadt, Amanpree...

work page arXiv 2019
[5]

generic” or “default

Same author or just same topic? towards content-independent style representations. InPro- ceedings of the 7th Workshop on Representation Learning for NLP, pages 249–268, Dublin, Ireland. Association for Computational Linguistics. Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R. Bowman, He He, and Shi Feng. 2025. ...

2025