Hijacking Text Heritage: Hiding the Human Signature through Homoglyphic Substitution
Pith reviewed 2026-05-10 15:20 UTC · model grok-4.3
The pith
Homoglyph substitution degrades stylometric systems by replacing characters with visually similar alternatives.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Performing homoglyph substitution on text degrades stylometric systems, allowing authors to reduce the leakage of personal information such as estimated age and geographic location that these systems can otherwise extract from voluntary text disclosures.
What carries the argument
Homoglyph substitution, defined as the replacement of characters with visually similar alternatives drawn from different Unicode code points (for example, Latin 'h' with Cyrillic 'h'), which targets and disrupts the character-level patterns that stylometric classifiers use.
If this is right
- Stylometric authorship attribution and trait inference become measurably less reliable on the altered text.
- Individuals can reduce the personal information extractable from their online writing while preserving visual readability.
- Adversarial stylometry provides a practical defense against forensic analysis of voluntary text disclosures.
- Text can be altered to hinder stylometric recovery of demographic signals such as age group or location.
Where Pith is reading between the lines
- Stylometric tools may require explicit Unicode normalization steps to remain effective against this class of obfuscation.
- An iterative arms race could develop between substitution techniques and improved detection or normalization methods.
- The approach might generalize to other character-based privacy protections in digital communication.
Load-bearing premise
Stylometric systems depend on character-level or Unicode-sensitive features that homoglyph substitution will reliably disrupt without being normalized away by standard preprocessing or creating new detectable signals.
What would settle it
An experiment in which stylometric accuracy on the modified text remains statistically unchanged from the original, or in which routine Unicode normalization restores full performance.
Figures
read the original abstract
In what way could a data breach involving government-issued IDs such as passports, driver's licenses, etc., rival a random voluntary disclosure on a nondescript social-media platform? At first glance, the former appears more significant, and that is a valid assessment. The disclosed data could contain an individual's date of birth and address; for all intents and purposes, a leak of that data would be disastrous. Given the threat, the latter scenario involving an innocuous online post seems comparatively harmless--or does it? From that post and others like it, a forensic linguist could stylometrically uncover equivalent pieces of information, estimating an age range for the author (adolescent or adult) and narrowing down their geographical location (specific country). While not an exact science--the determinations are statistical--stylometry can reveal comparable, though noticeably diluted, information about an individual. To prevent an ID from being breached, simply sharing it as little as possible suffices. Preventing the leakage of personal information from written text requires a more complex solution: adversarial stylometry. In this paper, we explore how performing homoglyph substitution--the replacement of characters with visually similar alternatives (e.g., "h" $\texttt{[U+0068]}$ $\rightarrow$ "h" $\texttt{[U+04BB]}$)--on text can degrade stylometric systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes homoglyphic substitution—replacing Latin characters with visually similar glyphs from other Unicode blocks (e.g., U+0068 'h' to U+04BB 'h')—as a technique for adversarial stylometry to degrade author attribution and profiling performance of stylometric systems.
Significance. If empirically validated, the approach could supply a lightweight, accessible method for textual privacy protection against stylometric inference of attributes such as age or location. It extends prior adversarial stylometry work but currently offers only a descriptive claim without demonstrated effectiveness or robustness.
major comments (2)
- Abstract: the central claim that homoglyph substitution degrades stylometric performance is unsupported by any experimental results, datasets, evaluation metrics, or implementation details; the manuscript provides no evidence that the substitution reliably disrupts feature extractors or avoids introducing new detectable signals such as elevated non-Latin script frequencies.
- Abstract: the argument assumes stylometric systems operate on raw Unicode codepoints without normalization (NFKC/NFD), script detection, or tokenization that collapses visually identical glyphs; no analysis or test is presented to show the substitution survives these standard preprocessing steps.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important gaps in empirical support and robustness analysis. We agree that the current manuscript is primarily conceptual and will incorporate experiments, implementation details, and preprocessing evaluations in the revised version to strengthen the claims.
read point-by-point responses
-
Referee: Abstract: the central claim that homoglyph substitution degrades stylometric performance is unsupported by any experimental results, datasets, evaluation metrics, or implementation details; the manuscript provides no evidence that the substitution reliably disrupts feature extractors or avoids introducing new detectable signals such as elevated non-Latin script frequencies.
Authors: We acknowledge that the present version offers a descriptive proposal without quantitative validation. The manuscript introduces homoglyphic substitution as an adversarial stylometry technique but does not report experiments, datasets, or metrics. In revision, we will add empirical evaluations on standard stylometric corpora (e.g., using author attribution accuracy and attribute inference F1 scores), detail the substitution algorithm and parameters, and explicitly test for introduced signals such as non-Latin character frequency distributions to demonstrate that the method does not create easily detectable artifacts. revision: yes
-
Referee: Abstract: the argument assumes stylometric systems operate on raw Unicode codepoints without normalization (NFKC/NFD), script detection, or tokenization that collapses visually identical glyphs; no analysis or test is presented to show the substitution survives these standard preprocessing steps.
Authors: This is a fair and substantive critique. The current text does not examine how homoglyph substitution interacts with common text normalization pipelines. We will revise the manuscript to include a dedicated analysis section that evaluates survival rates under NFKC/NFD normalization, script detection heuristics, and various tokenizers (e.g., word-level, subword, and Unicode-aware). Where the substitution is neutralized, we will discuss mitigation strategies or clearly delineate the threat model under which the technique remains effective. revision: yes
Circularity Check
No circularity: purely descriptive claim with no derivations or fitted elements
full rationale
The paper presents an exploratory idea that homoglyph substitution can degrade stylometric systems. No equations, parameters, predictions, or derivation chains appear in the provided text. The abstract and description frame the work as an investigation rather than a mathematical result derived from prior self-referential steps. None of the enumerated circularity patterns (self-definitional, fitted-input-as-prediction, self-citation load-bearing, etc.) apply, as there are no load-bearing logical reductions to inspect.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.