Confusable Detection 101: Unicode Skeletons and Mixed-Script Checks for Your Brands

NameSilo Staff

11/20/2025

The internationalization of domain names through IDN (Internationalized Domain Names) brought multilingual support to the internet, allowing domains to use characters beyond basic ASCII. However, this expansion also introduced security challenges. Characters from different scripts often look identical or nearly identical to each other, enabling attackers to register domains that appear legitimate but actually redirect to phishing sites or fraudulent content.

Understanding how these confusable characters work, how registries and browsers defend against them, and what tools exist for brand protection helps organizations safeguard their online presence. The technical foundation comes from Unicode Technical Standard #39, which defines methods for detecting confusable characters and mixed-script abuse. For brand owners, this knowledge translates into actionable strategies for monitoring and preventing domain-based attacks.

Unicode Confusables and Homoglyphs Explained

Unicode includes over 140,000 characters covering hundreds of scripts and writing systems. Many scripts evolved independently but share visually similar characters. The Latin letter "a" looks identical to the Cyrillic "а" (U+0430) in most fonts. Greek "ο" (omicron, U+03BF) is visually indistinguishable from Latin "o" (U+006F). These look-alike characters are called homoglyphs.

Homoglyphs create opportunities for attacks. If your brand operates at example.com, an attacker might register exаmple.com using Cyrillic "а" instead of Latin "a". To users, the domain appears identical, but browsers treat them as completely different domains. Links in phishing emails or social media posts can exploit this confusion, directing victims to malicious sites that mimic legitimate brands.

The scope of confusables extends beyond single-character substitutions. Entire words can be constructed from different scripts that look identical:

Latin "scope" vs. Cyrillic "ѕсоре"

Latin "apple" vs. Greek/Cyrillic mix "аррӏе"

Latin "microsoft" with various character substitutions from multiple scripts

These mixed-script attacks combine characters from different writing systems to create visually convincing but technically distinct domain names. When you register a domain, you secure one specific sequence of Unicode code points. Confusable variations remain available for registration by others unless registries implement protections.

UTS #39: Unicode Security Mechanisms

Unicode Technical Standard #39 (UTS #39) provides the technical foundation for detecting and mitigating confusable character attacks. The standard defines several key concepts and mechanisms.

Confusable Detection: UTS #39 maintains tables of characters that are visually confusable with each other. These tables map confusable characters to representative forms, allowing systems to identify when different character sequences produce visually similar results.

Skeleton Mapping: The skeleton algorithm transforms a string into a canonical form by replacing each character with a representative character from its confusable set. Two strings that produce the same skeleton are considered confusable. For example, both "scope" (Latin) and "ѕсоре" (Cyrillic) might map to the same skeleton, indicating they are confusable.

Mixed-Script Detection: UTS #39 defines rules for identifying strings that inappropriately mix characters from different scripts. While some script mixing is legitimate (like Japanese using Kanji, Hiragana, and Katakana together), other combinations typically indicate attacks (like mixing Latin and Cyrillic in English words).

Restriction Levels: The standard defines restriction levels that specify which character combinations are acceptable:

ASCII-Only: Only basic ASCII characters

Single-Script: Characters from a single script (plus Common/Inherited)

Moderately Restrictive: Allows certain well-established script combinations

Minimally Restrictive: Permits most script combinations but blocks highly confusable ones

Unrestricted: Allows any Unicode characters

Different contexts apply different restriction levels. Domain registries typically use moderately restrictive policies, while some applications might enforce ASCII-only or single-script rules for security-critical identifiers.

How Skeleton Mapping Works

The skeleton algorithm provides a practical method for detecting confusable strings. The process involves several steps:

Normalization: Convert the input string to Unicode Normalization Form NFD (canonical decomposition)

Case folding: Convert to lowercase using Unicode case folding rules

Confusable mapping: Replace each character with its skeleton representative from the confusable tables

Re-normalization: Apply NFD normalization again to the result

The output skeleton represents the "essential form" of the string. If two different input strings produce identical skeletons, they are visually confusable.

For example:

Input: "раypal" (using Cyrillic р and а)

Skeleton: "paypal"

Input: "paypal" (all Latin)

Skeleton: "paypal"

Both inputs produce the same skeleton, identifying them as confusable even though they use different Unicode characters. This mathematical approach allows automated systems to detect confusable domains at scale.

Registry operators and brand protection services use skeleton mapping to identify potentially problematic domain registrations. When someone attempts to register a domain that produces a skeleton matching an existing protected trademark, the registry can block the registration or flag it for review.

Registry-Level Protections and IDN Policies

Domain name registries implement various policies to prevent confusable abuse. These policies balance enabling multilingual internet access with preventing security threats.

IDN Tables: Each registry maintains an IDN table specifying which Unicode characters are permitted for domain registrations in their TLD. These tables exclude characters known to be highly confusable or problematic. For example, many registries disallow mixing Latin and Cyrillic in the same domain label.

Variant Bundles: Some registries implement variant bundling, where confusable domains are automatically bundled together. When someone registers one variant, all confusable variants are blocked from registration by others. The original registrant can activate variants if desired but others cannot register them separately.

Mixed-Script Restrictions: Most registries prohibit arbitrary script mixing within domain labels. A domain must consist primarily of characters from a single script, with exceptions for common characters like digits and hyphens that can appear in any script context.

Whole-Script Confusables: Beyond individual character confusables, some registries identify and block whole-script confusables—domains where an entire word in one script looks like a word in another script.

Reserved Names: Registries maintain lists of reserved domain names, often including major brands and well-known entities. These reservations prevent confusable variants from being registered in the first place.

The .com registry, for instance, uses a hybrid approach. While it supports various scripts for IDNs, it applies strict rules about script mixing and maintains confusability tables to prevent obvious homoglyph attacks. Other TLDs may have more or less restrictive policies depending on their target audiences and security priorities.

Browser and Client-Side Protections

While registries provide the first line of defense, browsers and other clients implement additional protections to warn users about potentially malicious confusable domains.

Punycode Display: Internationalized domain names are encoded as ASCII-compatible strings using Punycode encoding. For example, "münchen.de" becomes "xn--mnchen-3ya.de". Browsers can choose to display domains in Unicode (münchen.de) or Punycode (xn--mnchen-3ya.de) depending on security analysis.

When a browser detects suspicious script mixing or confusable characters, it displays the Punycode representation instead of the Unicode form. This makes the domain look obviously different from the intended target, alerting users to potential phishing. If you visit a site using mixed scripts that triggers browser warnings, you will see "xn--..." in the address bar rather than the Unicode characters.

Homograph Attack Detection: Browsers maintain their own confusable character databases and apply detection algorithms. Different browsers use varying approaches:

Chrome: Uses ICU (International Components for Unicode) libraries implementing UTS #39 recommendations

Firefox: Applies mixed-script detection and shows Punycode for suspicious combinations

Safari: Uses similar mixed-script policies with additional platform-specific heuristics

Top-Level Domain Considerations: Browsers consider the TLD when evaluating IDN safety. A Cyrillic domain under a Russian TLD (.ru) is more likely to be legitimate than a Cyrillic domain under .com. Browsers may display Unicode forms for IDNs in country-code TLDs associated with the script but show Punycode for the same characters in generic TLDs.

These client-side protections significantly reduce confusable domain effectiveness. However, attackers continue attempting these registrations hoping to catch users with outdated browsers or in contexts where URLs are not carefully examined (like email links or QR codes).

Brand Monitoring and Detection Strategies

Organizations protecting their brands must actively monitor for confusable domain registrations. Several strategies and tools help with this:

Automated Scanning: Services that monitor new domain registrations can apply skeleton mapping algorithms to identify newly registered domains that are confusable with your brand. These services alert you when suspicious registrations occur, allowing rapid response.

Comprehensive Variant Generation: Generate all possible confusable variants of your brand name using UTS #39 confusable tables. This produces a list of domains that could be used in attacks. Proactively registering high-risk variants prevents attackers from using them, though this can be expensive for brands with many potential variants.

Certificate Transparency Monitoring: Certificate Transparency logs record all SSL certificates issued for domains. Monitoring these logs for certificates issued to confusable variants of your domain reveals when attackers set up HTTPS sites attempting to appear legitimate. CT log monitoring tools can apply skeleton mapping to identify confusable certificates.

WHOIS Monitoring: Regular WHOIS queries for confusable variants help identify registrations early. While WHOIS privacy protects legitimate registrant information, the registration event itself is observable. Automated systems can poll for variants and alert when previously unregistered confusables become registered.

Search Engine Monitoring: Search for your brand name and common variants to identify sites attempting to impersonate you. Attackers often try to rank confusable domains for your brand terms, making them discoverable through search results.

Social Media Scanning: Monitor social media platforms for links to confusable domains. Phishing campaigns often use social media to distribute malicious links, and early detection helps platforms remove the content before widespread distribution.

Defensive Registration Strategies

Proactively registering confusable variants provides strong protection but must be balanced against cost and practicality. Not every possible variant warrants registration.

High-Value Variants: Focus on the most visually convincing confusables. If your brand is "scope", prioritize registrations like "sсope" (Cyrillic с) that look identical in most contexts. Lower-priority variants might include more obscure character substitutions less likely to fool users.

TLD Prioritization: Register high-risk variants in major TLDs (.com, .net, .org) before considering smaller TLDs. Attackers target widely recognized TLDs for maximum impact.

Reactive Registration: When you discover attackers have registered confusable variants, consider registering similar patterns they might try next. Attackers often test multiple variants, and registering related confusables prevents their expansion.

Legal Protections: For brands with registered trademarks, legal mechanisms like UDRP (Uniform Domain-Name Dispute-Resolution Policy) provide recourse against confusable domains used in bad faith. However, legal processes take time, making proactive registration or rapid detection preferable for critical brands.

For organizations managing substantial domain portfolios, defensive registrations add to operational complexity. Each registered domain requires renewal, DNS configuration, and potentially hosting or redirects to your main site. Automated portfolio management tools help handle these operational requirements at scale.

Detection Tooling and Implementation

Several categories of tools help implement confusable detection:

Open-Source Libraries: Libraries implementing UTS #39 algorithms exist for most programming languages. ICU (International Components for Unicode) provides comprehensive Unicode support including confusable detection. Language-specific libraries include:

Python: confusable_homoglyphs, unidecode

JavaScript: confusables, unicode-tr39

Java: ICU4J

Go: golang.org/x/text/unicode/norm

These libraries allow you to build custom detection systems that check domain registrations, email addresses, usernames, or other identifiers against your brand names.

Commercial Services: Brand protection companies offer monitoring services that continuously scan for confusable registrations. These services often include:

Real-time alerts for new registrations

Risk scoring based on similarity and usage

Takedown assistance and legal support

Historical data about attack patterns

Registry Services: Some registries offer brand protection services directly, allowing trademark holders to register protections that automatically block confusable registrations. This is more effective than defensive registration because it prevents the domains from being registered at all rather than requiring you to register them yourself.

Browser Extensions: Security-focused browser extensions can highlight confusable domains in links before users click them. These extensions apply real-time confusable detection to URLs displayed on pages or in link previews.

Implementing Your Own Detection System

Organizations wanting to build internal confusable detection can start with skeleton-based matching:

from confusable_homoglyphs import confusables

def get_skeleton(text):

# Get the skeleton representation of text

skeleton = confusables.skeleton(text)

return skeleton

def is_confusable(text1, text2):

# Check if two texts are confusable

return get_skeleton(text1) == get_skeleton(text2)

# Example usage

brand = "example"

suspicious = "ехаmple" # Using Cyrillic х

if is_confusable(brand, suspicious):

print(f"Warning: '{suspicious}' is confusable with '{brand}'")

This basic approach can be extended to:

Monitor new domain registration feeds

Scan Certificate Transparency logs

Validate user-submitted identifiers

Check domains before sharing in communications

More sophisticated systems incorporate:

Context-aware analysis (considering TLD, usage patterns)

Machine learning to identify intentional attacks vs. coincidental similarity

Integration with threat intelligence feeds

Automated response workflows (alerts, WHOIS queries, legal notifications)

Special Considerations for Email Security

Email represents a major attack vector for confusable domain abuse. Phishing emails from confusable domains appear legitimate, bypassing user scrutiny. Several email-specific considerations apply:

SPF, DKIM, and DMARC: Email authentication mechanisms help but do not completely solve confusable domain problems. A perfectly configured SPF/DKIM/DMARC setup on a confusable domain will still pass authentication checks because the domain is technically different from your legitimate domain.

Display Name Spoofing: Even without confusable domains, attackers can set email display names to show legitimate-looking addresses. Combined with confusable domains, this creates convincing phishing emails. Email security solutions should validate both display names and actual addresses.

Link Analysis: Email security systems should analyze links in messages for confusable domains. Even if the sender address is legitimate, embedded links to confusable domains indicate potential compromise or phishing.

User Training: Educating users to carefully check domain names in email addresses and links remains critical. However, confusable characters make visual inspection unreliable, emphasizing the need for technical controls.

Emerging Threats and Future Developments

As defenses improve, attackers adapt their techniques. Several emerging patterns deserve attention:

Subdomain Confusables: Rather than registering confusable second-level domains, attackers might use confusable subdomains under legitimate domains they control. "paypal.attacker.com" is obvious, but "раypal.attacker.com" (with Cyrillic р) might be overlooked in casual inspection.

QR Code Attacks: QR codes hide the actual URL from users. A QR code pointing to a confusable domain gives no visual indication of the deception. As QR code usage increases, this attack vector grows.

Voice Interfaces: Voice assistants and dictation systems introduce ambiguity when users speak domain names. "example.com" and confusable variants sound identical, potentially leading users to malicious sites.

AI-Generated Domains: Machine learning systems can generate optimized confusable domains that evade specific detection algorithms while remaining visually convincing. This arms race between detection and evasion continues evolving.

The Unicode Consortium regularly updates UTS #39 to address newly identified confusable characters and attack patterns. Staying current with these updates ensures detection systems remain effective against evolving threats.

Legal and Policy Frameworks

Beyond technical measures, legal and policy frameworks provide recourse against confusable domain abuse:

UDRP: The Uniform Domain-Name Dispute-Resolution Policy allows trademark holders to challenge domain registrations made in bad faith. Confusable domains used for phishing or brand impersonation often meet UDRP criteria for transfer or cancellation.

Trademark Rights: Registered trademarks provide legal standing to challenge confusable domains. The more distinctive and well-known your mark, the stronger your case against confusable registrations.

Registry Policies: Registries may have additional policies for handling confusable domain complaints. Some allow expedited takedowns for clearly fraudulent confusables of major brands.

Law Enforcement: Domains used in criminal phishing or fraud schemes fall under law enforcement jurisdiction. Evidence of confusable domain use in crimes can lead to domain seizures and criminal prosecution.

Safe Harbor Provisions: Registrars and registries often have safe harbor protections when acting on good-faith takedown requests. This encourages cooperation in removing problematic confusable domains.

Practical Action Plan for Brand Protection

Organizations should implement a systematic approach to confusable domain protection:

Assessment Phase:

Inventory all brand names, product names, and key identifiers

Generate confusable variants using skeleton mapping

Assess risk levels for each variant based on visual similarity and attack likelihood

Check which high-risk variants are already registered

Protection Phase:

Implement monitoring for newly registered confusables

Set up alerts for certificate transparency log entries

Configure email security to flag confusable sender domains

Response Phase:

Establish procedures for investigating suspicious confusable registrations

Prepare UDRP and legal templates for challenging problematic domains

Create communication plans for warning customers about active phishing campaigns

Document takedown processes for various registrars and hosting providers

Continuous Improvement:

Review effectiveness of monitoring systems quarterly

Update confusable variant lists as new attack patterns emerge

Train security teams on latest confusable attack techniques

Participate in industry information sharing about brand impersonation threats

Balancing Security with Legitimate Use

While aggressive confusable detection protects brands, overly broad restrictions can impact legitimate users. A Korean company named "아마존" should be able to register that domain even though skeleton mapping might relate it to "amazon". Registry policies and detection systems must distinguish between coincidental similarity and intentional impersonation.

Context matters significantly. A confusable domain registered years before your brand became well-known might be completely legitimate. Geographic and linguistic context helps—confusables in script-appropriate TLDs serving their native communities differ from confusables in inappropriate TLDs targeting other markets.

Human review of flagged domains prevents false positives from disrupting legitimate businesses. Automated detection identifies suspicious registrations, but investigation should consider:

Registration date relative to brand prominence

Website content and intent

TLD appropriateness for the script used

Registrant identity and location

Historical use patterns

This balanced approach protects brands while respecting legitimate internationalized domain usage.

Conclusion

Unicode confusables represent a persistent challenge in internet security. The technical mechanisms defined in UTS #39 provide strong foundations for detection, but brand protection requires vigilance, appropriate tooling, and systematic monitoring. Understanding skeleton mapping, mixed-script policies, and registry protections enables organizations to defend against these sophisticated attacks while supporting legitimate multilingual internet use.

As domain internationalization expands and attackers refine their techniques, staying informed about Unicode security developments remains essential. Organizations that invest in confusable detection and brand monitoring significantly reduce their exposure to phishing, fraud, and brand impersonation through look-alike domains.

NameSilo StaffThe NameSilo staff of writers worked together on this post. It was a combination of efforts from our passionate writers that produce content to educate and provide insights for all our readers.

Share Your Thoughts

Be the first who shares their thoughts with us. Don’t miss out; we’re eager to hear what you think too!

Jump to

Confusable Detection 101: Unicode Skeletons and Mixed-Script Checks for Your Brands

Unicode Confusables and Homoglyphs Explained

UTS #39: Unicode Security Mechanisms

How Skeleton Mapping Works

Registry-Level Protections and IDN Policies

Browser and Client-Side Protections

Brand Monitoring and Detection Strategies

Defensive Registration Strategies

Detection Tooling and Implementation

Implementing Your Own Detection System

Special Considerations for Email Security

Emerging Threats and Future Developments

Legal and Policy Frameworks

Practical Action Plan for Brand Protection

Balancing Security with Legitimate Use

Conclusion

Share Your Thoughts

Recommended

Should You Register Multiple TLDs for Brand Protection?

Can a Domain Be Truly Owned? The Legal and Practical Reality (2026 Update)

Why Brand Signals Start at the URL: The Cognitive Impact of Domain Structure

Recommended

Should You Register Multiple TLDs for Brand Protection?

Can a Domain Be Truly Owned? The Legal and Practical Reality (2026 Update)

Why Brand Signals Start at the URL: The Cognitive Impact of Domain Structure

Share Your Thoughts