A registrant requests an IDN from a registrar that supports IDNs. The registrar converts the local-language characters into a sequence of supported letters using an ASCII-compatible encoding (ACE). The registrar submits the ACE string to the Verisign® Shared Registration System (SRS) where it is validated. The IDN is added to the .com and .net TLD zone files and propagated across the Internet.
When a user enters an IDN using native scripts into a Web browser or follows a link, IDN-enabled applications encode the characters into an ACE string that the DNS understands. The DNS processes the request and returns the information to the application. Although the process sounds simple, IDN-enabled application and the DNS support of different languages and scripts has required significant research and development.
The Internet Engineering Task Force (IETF) led the effort to create standards for using non-ASCII characters in the Domain Name System (DNS).
The DNS only recognizes ASCII characters A-Z, 0-9 and '-'. This limits the number of characters that can be utilized to build domain names to 37 of the more than 96,000 characters identified within Unicode. To create domain names from the range of Unicode characters, a character-encoding scheme that uniquely maps Unicode code points to an ASCII representation must be used and standardized.
The IETF published these standards related to Internationalized Domain Names (IDN): Encoding Schemes, Framework, Protocol, Unicode and Right-to-Left Scripts.
The encoding scheme for IDNs uses Punycode, an ASCII Compatible Encoding (ACE) that encodes local language characters into ASCII characters such that DNS can accurately answer a request for an address record. To select Punycode as the ACE standard, IETF considered the balance between compression and implementation. Punycode allows the greatest number of characters (code points) to be represented and is not difficult to deploy.
This RFC is one of a collection that, together, describe the protocol and usage context for a revision of Internationalized Domain Names for Applications (IDNA) that was largely completed in 2008, known within the series and elsewhere as "IDNA2008." The series replaces an earlier version of IDNA [RFC 3490] [RFC 3491]. For convenience, that version of IDNA is referred to as "IDNA2003." The newer version continues to use the Punycode algorithm [RFC3492] and the ACE (ASCII-Compatible Encoding) prefix from the earlier version.
This RFC describes the core IDNA2008 protocol and its operations. In combination with the "bi-directional" (Bidi) document described below, it explicitly updates and replaces [RFC 3490].
This RFC specifies rules for deciding whether a code point, considered in isolation or in context, is a candidate for inclusion in an IDN. It is part of the specification of IDNA2008.
The use of right-to-left scripts in Internationalized Domain Names (IDNs) has presented several challenges. This RFC provides new Bidi rules for Internationalized Domain Names for Applications (IDNA) labels, based on the problems encountered with some scripts and some shortcomings in the 2003 IDNA Bidi criterion.
This RFC provides the background, explanation and rationale for the need of new RFCs to tackle issues that have risen out of the previous version(s) of IDNA. The need to update the version of Unicode supported in IDNs is also discussed in this RFC.
These standards have been published and are now available:
Verisign is committed to following the IETF standards and supporting rapid deployment of this new technology.
Internationalized Domain Names (IDNs) are second- or third-level domain names or Web addresses registered in any character set or script defined in Unicode.
Understanding how Verisign IDNs support domain name registration in hundreds of native languages with a single Shared Registration System (SRS) requires an understanding of how characters and script are used in written language and translated for computing.
A script is a collection of symbols used to represent textual information in a language. Examples of scripts: Latin, Arabic, Han, Greek.
A character is the basic building block of any script, and thus any written language. It invokes a meaning at a fundamental level; you cannot break a character down any further and still have meaning.
A written language utilizes characters from one or more scripts to communicate meaning. Examples of languages: English, Farsi, Chinese, Greek.
Different scripts use different keyboards or soft keyboards for input into computing devices. Computer operating systems have Input Method Editors (IME) that facilitates the input of different scripts. IDNs are a similar type of adaptation, allowing people to use their local-language script to navigate the Web, send and receive email, transfer files and other applications that require domain names.
A computer uses encoding of characters to understand them. Each character within a character set is assigned a unique number. For example, in the ASCII-coded character set, the uppercase "A" is assigned the number 65. Most domain names are registered in ASCII characters (A to Z, 0 to 9 and the hyphen “-“). However, non-English words that require diacritics such as Spanish and French, and languages that use non-Latin scripts such as Kanji and Arabic, cannot be rendered in ASCII. Unicode is a universal coded character set, which covers as many as 350 different native languages. For this reason, IDNs use Unicode.
The Verisign IDN infrastructure complies with ICANN Registry Implementation Committee (RIC) guidelines and requires that each IDN be associated with a specific language using a “language tag.” The registrant selects the IDN language tag during the registration process. If an IDN combines more than one language, the registrant must select the most appropriate language. Not all language tags are referenced today; however, capturing the information during the registration process allows the adoption of language tables in the future. Download the PDF list of Verisign valid language tags
When an IDN registration is requested, the language tag is checked against a list of languages that have character inclusion tables or character-variant mapping tables. These tables are applied to the Unicode points that make up a registration to determine whether the registration is valid for a specific language. If a registration fails for one language, the character set may still be available with a different language tag.
Verisign has worked to address the issue of character variants with interested stakeholders. Registrants typically register domain names that have meaning in their own language such as a name, word or phrase. However, a single script may be used by more than one language.
As a result, a domain name may have different meanings in the context of other languages or cultures. The variant phenomenon has been classified into four different categories: character, orthographic, lexemic and contextual variants. Verisign has determined that addressing character variants is essential to enable users to navigate the Internet in their own languages. The other variants require difficult linguistic judgments that are not essential to delivering a robust IDN solution.
Many languages may have character variants that could potentially cause end-user confusion. For example, the Chinese language has two written forms: Simplified Chinese; used primarily in Mainland China, and Traditional Chinese, used primarily in Taiwan, Hong Kong and other Southeast Asian countries. The two written forms share many characters; however, simplified characters in Simplified Chinese may have the same meaning as complex characters in Traditional Chinese. These characters, called character variants, have the same meaning and pronunciation, but they do not look the same.
Different thought leaders in the technical community have suggested different approaches to address the character variant issue. Each approach has both positive and negative aspects. However, the IDN community is in agreement that the character variant issue may never fully be addressed because languages are always in a state of change. New character variants between languages will continue to be introduced into languages. Verisign has adopted language tags that reference language tables to address the character variant issue.
Verisign has worked to address the issue of character variants with interested stakeholders, including China Network Information Center (CNNIC) (.cn), Taiwan Network Information Center (TWNIC) (.tw), National Internet Development Agency of Korea (.kr), Japan Registry Service (JPRS) (.jp), the Chinese Domain Name Consortium (CDNC) and the IDN Implementation Committee established by ICANN.
Verisign has developed a policy for IDN registrations specifying permissible and prohibited code points.
The Verisign Shared Registration System (SRS) allows the creation of Internationalized Domain Names (IDNs) that contain Unicode supported non-ASCII scripts.
Understand the five validation rules through which the policy is implemented.
After validating an IDN, Verisign executes some further logic based on the Language Tag of the registration.