Insert a Unicode Script

A Unicode script is on of the many Unicode properties that you can insert via the Insert Token button on the Create panel.

Insert a Unicode script

The Unicode standard places each assigned code point (character) into exactly one script. A script is a group of code points used by a particular human writing system. Insert a regular expression token to match a Unicode script if you want to match any character from a particular Unicode script. This makes it easy to match any character from a certain writing system. Note that a writing system is not the same as a language. Some writing systems like Latin span multiple languages, while some languages like Japanese have multiple scripts.

ISO 15924 is a standard that assigns 3-digit numbers and 4-letter codes to all human writing systems. Unicode has adopted these 4-letter codes as aliases to its script names. If your application supports these then you can tick “short script names” in the Unicode Script dialog box to insert one of these 4-letter codes instead of Unicode’s full name for the script.

Forcing each character into a single script (with two generic scripts “Common” and “Inherited” for characters to be shared between script) has proved to be restrictive for related scripts that share characters. Unicode solved this by introducing script extensions. Script extensions use the same script names as the base Unicode scripts property. But script extensions can assign multiple of those script names to a single character.

Some applications provide separate syntax for matching the Script property or the Script_Extensions property. For those applications you can select “base script only” to generate a Script property, or “include script extensions” to generate a Script_Extensions property. Other applications only have one syntax for matching a Unicode script, which could be the Script property or the Script_Extensions property, depending on the application. For those applications, the dialog box automatically selects the available option and disables the other.

The list of available scripts depends on the version of Unicode that your application supports. You can select any one of those scripts from the list. The list is always ordered alphabetically by the full names of the scripts. The exception is the “Unknown” script which is listed first if your application supports it. The “Unknown” script is used exclusively for unassigned code points.

If you don’t know the name of the script but you have a character that is part of the script that you want then you can enter that character or the hexadecimal representation of its code point into the “select this character’s script” box. This immediately selects that character’s script in the list. The lookup always selects the character’s value for the Script property, even if you selected the option for script extensions. The preview that shows the characters included in the selected script does take into account whether you want to use the Script or Script_Extensions property.