Insert a Unicode Grapheme Cluster Break Value

A Unicode grapheme cluster break value is one of the many Unicode properties that you can insert via the Insert Token button on the Create panel.

Insert a Unicode grapheme cluster break value

Every Unicode code point has exactly one value for the Grapheme_Cluster_Break property. This property is part of Unicode Standard Annex 29 (UAX 29) titled “Unicode Text Segmentation”. This property is used to determine the boundaries between grapheme clusters. Such a boundary is a grapheme cluster break. The property alone does not determine where the breaks are. Rather, the rules in UAX 29 use the values that the characters before and after a position in the text have for this property to determine whether there is a grapheme cluster break at that position.

It’s not very likely that you would need to match this property with a regular expression. You could use it to implement the rules in UAX 29 using regular expressions. But most regex flavors that support this property also support \b{gcb} or \b{g} to match an actual grapheme cluster break according to UAX 29. You can insert one of those with the Grapheme Cluster Boundary item in the Anchor submenu of the Insert Token menu.

If you wanted to match a grapheme then you could use the Any Unicode Grapheme item in the Character submenu of the Insert Token menu to insert \X or an equivalent token to directly match a Unicode grapheme. However, your application may have its own ideas of what a grapheme should be rather than follow the rules of UAX 29.