doc.go

  1/*
  2Package uniseg implements Unicode Text Segmentation, Unicode Line Breaking, and
  3string width calculation for monospace fonts. Unicode Text Segmentation conforms
  4to Unicode Standard Annex #29 (https://unicode.org/reports/tr29/) and Unicode
  5Line Breaking conforms to Unicode Standard Annex #14
  6(https://unicode.org/reports/tr14/).
  7
  8In short, using this package, you can split a string into grapheme clusters
  9(what people would usually refer to as a "character"), into words, and into
 10sentences. Or, in its simplest case, this package allows you to count the number
 11of characters in a string, especially when it contains complex characters such
 12as emojis, combining characters, or characters from Asian, Arabic, Hebrew, or
 13other languages. Additionally, you can use it to implement line breaking (or
 14"word wrapping"), that is, to determine where text can be broken over to the
 15next line when the width of the line is not big enough to fit the entire text.
 16Finally, you can use it to calculate the display width of a string for monospace
 17fonts.
 18
 19# Getting Started
 20
 21If you just want to count the number of characters in a string, you can use
 22[GraphemeClusterCount]. If you want to determine the display width of a string,
 23you can use [StringWidth]. If you want to iterate over a string, you can use
 24[Step], [StepString], or the [Graphemes] class (more convenient but less
 25performant). This will provide you with all information: grapheme clusters,
 26word boundaries, sentence boundaries, line breaks, and monospace character
 27widths. The specialized functions [FirstGraphemeCluster],
 28[FirstGraphemeClusterInString], [FirstWord], [FirstWordInString],
 29[FirstSentence], and [FirstSentenceInString] can be used if only one type of
 30information is needed.
 31
 32# Grapheme Clusters
 33
 34Consider the rainbow flag emoji: 🏳️‍🌈. On most modern systems, it appears as one
 35character. But its string representation actually has 14 bytes, so counting
 36bytes (or using len("🏳️‍🌈")) will not work as expected. Counting runes won't,
 37either: The flag has 4 Unicode code points, thus 4 runes. The stdlib function
 38utf8.RuneCountInString("🏳️‍🌈") and len([]rune("🏳️‍🌈")) will both return 4.
 39
 40The [GraphemeClusterCount] function will return 1 for the rainbow flag emoji.
 41The Graphemes class and a variety of functions in this package will allow you to
 42split strings into its grapheme clusters.
 43
 44# Word Boundaries
 45
 46Word boundaries are used in a number of different contexts. The most familiar
 47ones are selection (double-click mouse selection), cursor movement ("move to
 48next word" control-arrow keys), and the dialog option "Whole Word Search" for
 49search and replace. This package provides methods for determining word
 50boundaries.
 51
 52# Sentence Boundaries
 53
 54Sentence boundaries are often used for triple-click or some other method of
 55selecting or iterating through blocks of text that are larger than single words.
 56They are also used to determine whether words occur within the same sentence in
 57database queries. This package provides methods for determining sentence
 58boundaries.
 59
 60# Line Breaking
 61
 62Line breaking, also known as word wrapping, is the process of breaking a section
 63of text into lines such that it will fit in the available width of a page,
 64window or other display area. This package provides methods to determine the
 65positions in a string where a line must be broken, may be broken, or must not be
 66broken.
 67
 68# Monospace Width
 69
 70Monospace width, as referred to in this package, is the width of a string in a
 71monospace font. This is commonly used in terminal user interfaces or text
 72displays or editors that don't support proportional fonts. A width of 1
 73corresponds to a single character cell. The C function [wcswidth()] and its
 74implementation in other programming languages is in widespread use for the same
 75purpose. However, there is no standard for the calculation of such widths, and
 76this package differs from wcswidth() in a number of ways, presumably to generate
 77more visually pleasing results.
 78
 79To start, we assume that every code point has a width of 1, with the following
 80exceptions:
 81
 82  - Code points with grapheme cluster break properties Control, CR, LF, Extend,
 83    and ZWJ have a width of 0.
 84  - U+2E3A, Two-Em Dash, has a width of 3.
 85  - U+2E3B, Three-Em Dash, has a width of 4.
 86  - Characters with the East-Asian Width properties "Fullwidth" (F) and "Wide"
 87    (W) have a width of 2. (Properties "Ambiguous" (A) and "Neutral" (N) both
 88    have a width of 1.)
 89  - Code points with grapheme cluster break property Regional Indicator have a
 90    width of 2.
 91  - Code points with grapheme cluster break property Extended Pictographic have
 92    a width of 2, unless their Emoji Presentation flag is "No", in which case
 93    the width is 1.
 94
 95For Hangul grapheme clusters composed of conjoining Jamo and for Regional
 96Indicators (flags), all code points except the first one have a width of 0. For
 97grapheme clusters starting with an Extended Pictographic, any additional code
 98point will force a total width of 2, except if the Variation Selector-15
 99(U+FE0E) is included, in which case the total width is always 1. Grapheme
100clusters ending with Variation Selector-16 (U+FE0F) have a width of 2.
101
102Note that whether these widths appear correct depends on your application's
103render engine, to which extent it conforms to the Unicode Standard, and its
104choice of font.
105
106[wcswidth()]: https://man7.org/linux/man-pages/man3/wcswidth.3.html
107*/
108package uniseg