Invisible Unicode Characters: What They Break and How to Hunt Them

Where they come from (almost never malice)

Word processors are the biggest source: autocorrect inserts non-breaking spaces and soft hyphens as you type, and both survive copy-paste into anything. PDFs shed whatever spacing the typesetter used. Web pages pad prose with   that becomes U+00A0 on your clipboard. Messaging apps, email signatures and some AI chat interfaces emit zero-width spaces as formatting artifacts. And multilingual keyboards insert perfectly legitimate right-to-left marks that only misbehave when the text lands in a left-to-right codebase. Every one of these is invisible in nearly every editor, which is why the class of bug survives.

What they break

String equality first: "admin" === "admin" is false when one side carries a zero-width space, which surfaces as the API key that fails only when pasted, the username that "already exists", the 2FA code that is somehow wrong. Parsers next: a non-breaking space is not the space split() splits on, and a figure space inside "1 000" quietly turns a number column into text. Then the collaboration layer: two visually identical lines diff as different, grep misses what your eyes can see, and a BOM at the start of a file breaks the first line of an otherwise valid config (a failure the JSONL guide knows well). None of this is exotic — it is a routine tax paid by anyone who pastes text for a living.

The two attacks that weaponize it

Trojan Source (CVE-2021-42574) uses bidirectional control characters to make source code display in a different order than compilers read it — a line that reviews as a harmless comment can execute as logic. The defense is mechanical: treat any bidi control character in source code as a build-breaking finding, which is exactly what modern compilers, GitHub and linters started flagging after the paper. Homoglyph spoofing swaps Latin letters for Cyrillic or Greek lookalikes: a domain or username that renders as a brand you trust while being a different string entirely. Registrars and browsers block the worst cases via punycode display, but usernames, package names and email display names remain soft targets — when a link feels off, inspecting the actual code points settles it in seconds.

Normalization: the fix that is two fixes

Unicode normalization exists because the same visible text can be encoded multiple ways — "é" as one code point or as "e" plus a combining accent. NFC (the sane default) merges those so equal-looking text compares equal; every input boundary that stores or compares user text should apply it. NFKC goes further and folds compatibility characters — ligatures, fullwidth letters, superscripts — into their plain forms, which is right for identifiers (usernames, search keys) and wrong for prose, where it destroys legitimate typography. Note what neither fixes: normalization does not remove zero-width characters, bidi controls or Cyrillic lookalikes — those need explicit detection and stripping, which is a policy decision, not a normalization form.

A practical hunting kit

Turn on your editor’s rendering (VS Code highlights invisible and ambiguous Unicode by default since the Trojan Source disclosure; check editor.unicodeHighlight). In a pinch, a regex like [\u200B-\u200F\u202A-\u202E\u2060\uFEFF] greps for the usual suspects. At system boundaries, NFC-normalize and strip format characters on input rather than chasing them downstream. And for the interactive case — the string in front of you, right now, that will not behave — paste it into the Unicode inspector: every code point named and highlighted in place, with a cleaned copy on one button. It runs entirely locally, so the suspicious text (which is sometimes a password or a key) never leaves your machine.

Ready to try it? Open the Unicode Inspector →

Where they come from (almost never malice)

What they break

The two attacks that weaponize it

Normalization: the fix that is two fixes

A practical hunting kit

Related guides