Hijacking Text Heritage: Hiding the Human Signature through Homoglyphic Substitution

Robert Dilworth

arxiv: 2604.10271 · v4 · pith:DPZUF7AHnew · submitted 2026-04-11 · 💻 cs.CR · cs.CL· cs.IR

Hijacking Text Heritage: Hiding the Human Signature through Homoglyphic Substitution

Robert Dilworth This is my paper

Pith reviewed 2026-05-10 15:20 UTC · model grok-4.3

classification 💻 cs.CR cs.CLcs.IR

keywords homoglyph substitutionstylometryadversarial stylometryprivacyUnicodeauthorship attributionforensic linguistics

0 comments

The pith

Homoglyph substitution degrades stylometric systems by replacing characters with visually similar alternatives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that swapping letters in text for look-alikes from other Unicode scripts can weaken stylometric tools that infer author traits such as age range or country from writing patterns. This matters because social-media posts already allow statistical recovery of personal details comparable to what leaks from ID documents, and simply avoiding text disclosure is impractical. The proposed method keeps the text readable while targeting the character features that stylometry depends on. A sympathetic reader would see this as a lightweight privacy tool for everyday online writing.

Core claim

Performing homoglyph substitution on text degrades stylometric systems, allowing authors to reduce the leakage of personal information such as estimated age and geographic location that these systems can otherwise extract from voluntary text disclosures.

What carries the argument

Homoglyph substitution, defined as the replacement of characters with visually similar alternatives drawn from different Unicode code points (for example, Latin 'h' with Cyrillic 'h'), which targets and disrupts the character-level patterns that stylometric classifiers use.

If this is right

Stylometric authorship attribution and trait inference become measurably less reliable on the altered text.
Individuals can reduce the personal information extractable from their online writing while preserving visual readability.
Adversarial stylometry provides a practical defense against forensic analysis of voluntary text disclosures.
Text can be altered to hinder stylometric recovery of demographic signals such as age group or location.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Stylometric tools may require explicit Unicode normalization steps to remain effective against this class of obfuscation.
An iterative arms race could develop between substitution techniques and improved detection or normalization methods.
The approach might generalize to other character-based privacy protections in digital communication.

Load-bearing premise

Stylometric systems depend on character-level or Unicode-sensitive features that homoglyph substitution will reliably disrupt without being normalized away by standard preprocessing or creating new detectable signals.

What would settle it

An experiment in which stylometric accuracy on the modified text remains statistically unchanged from the original, or in which routine Unicode normalization restores full performance.

Figures

Figures reproduced from arXiv: 2604.10271 by Robert Dilworth.

**Figure 2.** Figure 2: Kagi Translate acts as a conduit for adversarial stylometry, rending au [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: A Taxonomic Overview of the Adversarial Attacks: [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: TraceTarnish: Our stylometric attack script–a gestalt modular framework where each component contributes to a whole that is greater than the sum of its parts; incorporating homoglyph functionality resulted in the following processing pipeline for razing authorship: Translation → Obfuscation → Imitation → Injection [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: An enumeration of the adversarial attacks examined by [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: The sentences to be evaluated, representing 100% Injection for each ex [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: A plot capturing the results of the homoglyph-based Injection-optimality [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: TraceTarnish, in its current state, implements an Injection amalgam, interspersing both homoglyphs and zero-width characters into text to shroud authorship. To demonstrate the efficiency of the Injection component, we rerun Experiment #1, incrementally introducing both homoglyphs and zero-width characters in a stepwise fashion. The following string represents 100% Injection, with the “bad characters” h… view at source ↗

**Figure 9.** Figure 9: Distance measures in stylometry are mathematical methods used to quan [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

read the original abstract

In what way could a data breach involving government-issued IDs such as passports, driver's licenses, etc., rival a random voluntary disclosure on a nondescript social-media platform? At first glance, the former appears more significant, and that is a valid assessment. The disclosed data could contain an individual's date of birth and address; for all intents and purposes, a leak of that data would be disastrous. Given the threat, the latter scenario involving an innocuous online post seems comparatively harmless--or does it? From that post and others like it, a forensic linguist could stylometrically uncover equivalent pieces of information, estimating an age range for the author (adolescent or adult) and narrowing down their geographical location (specific country). While not an exact science--the determinations are statistical--stylometry can reveal comparable, though noticeably diluted, information about an individual. To prevent an ID from being breached, simply sharing it as little as possible suffices. Preventing the leakage of personal information from written text requires a more complex solution: adversarial stylometry. In this paper, we explore how performing homoglyph substitution--the replacement of characters with visually similar alternatives (e.g., "h" $\texttt{[U+0068]}$ $\rightarrow$ "h" $\texttt{[U+04BB]}$)--on text can degrade stylometric systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Homoglyph substitution for stylometry defense is an obvious idea that needs testing to be useful.

read the letter

The main takeaway is that this paper suggests homoglyph substitution as a simple way to degrade stylometric analysis of text for privacy, but it offers no experiments or data to show whether the idea actually works in practice. The core proposal is to replace characters with visually similar ones from other scripts, like swapping a Latin 'h' for a Cyrillic equivalent, to break author-style signals in writing. This is framed around real privacy risks, such as stylometry pulling age or location estimates from social media posts, which the paper contrasts with more obvious ID leaks. That motivation section is clear and direct. The new angle is applying the known homoglyph trick specifically to adversarial stylometry, rather than just security contexts like phishing. It does a reasonable job explaining the substitution process without overcomplicating it. The big gap is the lack of any evaluation. No datasets, no stylometric baselines, no accuracy drops measured before and after substitution. The claim depends on stylometric systems relying on raw Unicode codepoints that don't get normalized or script-detected in preprocessing, which is a shaky assumption. Mixed-script text could easily create its own detectable patterns instead. Without code or results, it's impossible to tell if this holds up or just adds noise that gets filtered. This is aimed at researchers working on text privacy or authorship obfuscation techniques. Someone building quick prototypes might pick up the concept as a starting point, but it won't give them validated methods to use. I would send it for peer review. Referees can flag the missing tests and point to standard stylometry pipelines that might neutralize this, which would help turn the idea into something more solid.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes homoglyphic substitution—replacing Latin characters with visually similar glyphs from other Unicode blocks (e.g., U+0068 'h' to U+04BB 'h')—as a technique for adversarial stylometry to degrade author attribution and profiling performance of stylometric systems.

Significance. If empirically validated, the approach could supply a lightweight, accessible method for textual privacy protection against stylometric inference of attributes such as age or location. It extends prior adversarial stylometry work but currently offers only a descriptive claim without demonstrated effectiveness or robustness.

major comments (2)

Abstract: the central claim that homoglyph substitution degrades stylometric performance is unsupported by any experimental results, datasets, evaluation metrics, or implementation details; the manuscript provides no evidence that the substitution reliably disrupts feature extractors or avoids introducing new detectable signals such as elevated non-Latin script frequencies.
Abstract: the argument assumes stylometric systems operate on raw Unicode codepoints without normalization (NFKC/NFD), script detection, or tokenization that collapses visually identical glyphs; no analysis or test is presented to show the substitution survives these standard preprocessing steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important gaps in empirical support and robustness analysis. We agree that the current manuscript is primarily conceptual and will incorporate experiments, implementation details, and preprocessing evaluations in the revised version to strengthen the claims.

read point-by-point responses

Referee: Abstract: the central claim that homoglyph substitution degrades stylometric performance is unsupported by any experimental results, datasets, evaluation metrics, or implementation details; the manuscript provides no evidence that the substitution reliably disrupts feature extractors or avoids introducing new detectable signals such as elevated non-Latin script frequencies.

Authors: We acknowledge that the present version offers a descriptive proposal without quantitative validation. The manuscript introduces homoglyphic substitution as an adversarial stylometry technique but does not report experiments, datasets, or metrics. In revision, we will add empirical evaluations on standard stylometric corpora (e.g., using author attribution accuracy and attribute inference F1 scores), detail the substitution algorithm and parameters, and explicitly test for introduced signals such as non-Latin character frequency distributions to demonstrate that the method does not create easily detectable artifacts. revision: yes
Referee: Abstract: the argument assumes stylometric systems operate on raw Unicode codepoints without normalization (NFKC/NFD), script detection, or tokenization that collapses visually identical glyphs; no analysis or test is presented to show the substitution survives these standard preprocessing steps.

Authors: This is a fair and substantive critique. The current text does not examine how homoglyph substitution interacts with common text normalization pipelines. We will revise the manuscript to include a dedicated analysis section that evaluates survival rates under NFKC/NFD normalization, script detection heuristics, and various tokenizers (e.g., word-level, subword, and Unicode-aware). Where the substitution is neutralized, we will discuss mitigation strategies or clearly delineate the threat model under which the technique remains effective. revision: yes

Circularity Check

0 steps flagged

No circularity: purely descriptive claim with no derivations or fitted elements

full rationale

The paper presents an exploratory idea that homoglyph substitution can degrade stylometric systems. No equations, parameters, predictions, or derivation chains appear in the provided text. The abstract and description frame the work as an investigation rather than a mathematical result derived from prior self-referential steps. None of the enumerated circularity patterns (self-definitional, fitted-input-as-prediction, self-citation load-bearing, etc.) apply, as there are no load-bearing logical reductions to inspect.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are specified or required for the high-level claim.

pith-pipeline@v0.9.0 · 5532 in / 977 out tokens · 39508 ms · 2026-05-10T15:20:38.105681+00:00 · methodology

Hijacking Text Heritage: Hiding the Human Signature through Homoglyphic Substitution

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)