libgrapheme.sh (6346B)
1cat << EOF 2.Dd ${MAN_DATE} 3.Dt LIBGRAPHEME 7 4.Os suckless.org 5.Sh NAME 6.Nm libgrapheme 7.Nd unicode string library 8.Sh SYNOPSIS 9.In grapheme.h 10.Sh DESCRIPTION 11The 12.Nm 13library provides functions to properly handle Unicode strings according 14to the Unicode specification in regard to character, word, sentence and 15line segmentation and case detection and conversion. 16.Pp 17Unicode strings are made up of user-perceived characters (so-called 18.Dq grapheme clusters , 19see 20.Sx MOTIVATION ) 21that are composed of one or more Unicode codepoints, which in turn 22are encoded in one or more bytes in an encoding like UTF-8. 23.Pp 24There is a widespread misconception that it was enough to simply 25determine codepoints in a string and treat them as user-perceived 26characters to be Unicode compliant. 27While this may work in some cases, this assumption quickly breaks, 28especially for non-Western languages and decomposed Unicode strings 29where user-perceived characters are usually represented using multiple 30codepoints. 31.Pp 32Despite this complicated multilevel structure of Unicode strings, 33.Nm 34provides methods to work with them at the byte-level (i.e. UTF-8 35.Sq char 36arrays) while also offering codepoint-level methods. 37Additionally, it is a 38.Dq freestanding 39library (see ISO/IEC 9899:1999 section 4.6) and thus does not depend on 40a standard library. This makes it easy to use in bare metal environments. 41.Pp 42Every documented function's manual page provides a self-contained 43example illustrating the possible usage. 44.Sh SEE ALSO 45.Xr grapheme_decode_utf8 3 , 46.Xr grapheme_encode_utf8 3 , 47.Xr grapheme_is_character_break 3 , 48.Xr grapheme_is_lowercase 3 , 49.Xr grapheme_is_lowercase_utf8 3 , 50.Xr grapheme_is_titlecase 3 , 51.Xr grapheme_is_titlecase_utf8 3 , 52.Xr grapheme_is_uppercase 3 , 53.Xr grapheme_is_uppercase_utf8 3 , 54.Xr grapheme_next_character_break 3 , 55.Xr grapheme_next_character_break_utf8 3 , 56.Xr grapheme_next_line_break 3 , 57.Xr grapheme_next_line_break_utf8 3 , 58.Xr grapheme_next_sentence_break 3 , 59.Xr grapheme_next_sentence_break_utf8 3 , 60.Xr grapheme_next_word_break 3 , 61.Xr grapheme_next_word_break_utf8 3 , 62.Xr grapheme_to_lowercase 3 , 63.Xr grapheme_to_lowercase_utf8 3 , 64.Xr grapheme_to_titlecase 3 , 65.Xr grapheme_to_titlecase_utf8 3 66.Xr grapheme_to_uppercase 3 , 67.Xr grapheme_to_uppercase_utf8 3 , 68.Sh STANDARDS 69.Nm 70is compliant with the Unicode ${UNICODE_VERSION} specification. 71.Sh MOTIVATION 72The idea behind every character encoding scheme like ASCII or Unicode 73is to express abstract characters (which can be thought of as shapes 74making up a written language). ASCII for instance, which comprises the 75range 0 to 127, assigns the number 65 (0x41) to the abstract character 76.Sq A . 77This number is called a 78.Dq codepoint , 79and all codepoints of an encoding make up its so-called 80.Dq code space . 81.Pp 82Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its 83first 128 codepoints are identical to ASCII's. The additional code 84points are needed as Unicode's goal is to express all writing systems 85of the world. 86To give an example, the abstract character 87.Sq \[u00C4] 88is not expressable in ASCII, given no ASCII codepoint has been assigned 89to it. 90It can be expressed in Unicode, though, with the codepoint 196 (0xC4). 91.Pp 92One may assume that this process is straightfoward, but as more and 93more codepoints were assigned to abstract characters, the Unicode 94Consortium (that defines the Unicode standard) was facing a problem: 95Many (mostly non-European) languages have such a large amount of 96abstract characters that it would exhaust the available Unicode code 97space if one tried to assign a codepoint to each abstract character. 98The solution to that problem is best introduced with an example: Consider 99the abstract character 100.Sq \[u01DE] , 101which is 102.Sq A 103with an umlaut and a macron added to it. 104In this sense, one can consider 105.Sq \[u01DE] 106as a two-fold modification (namely 107.Dq add umlaut 108and 109.Dq add macron ) 110of the 111.Dq base character 112.Sq A . 113.Pp 114The Unicode Consortium adapted this idea by assigning codepoints to 115modifications. 116For example, the codepoint 0x308 represents adding an umlaut and 0x304 117represents adding a macron, and thus, the codepoint sequence 118.Dq 0x41 0x308 0x304 , 119namely the base character 120.Sq A 121followed by the umlaut and macron modifiers, represents the abstract 122character 123.Sq \[u01DE] . 124As a side-note, the single codepoint 0x1DE was also assigned to 125.Sq \[u01DE] , 126which is a good example for the fact that there can be multiple 127representations of a single abstract character in Unicode. 128.Pp 129Expressing a single abstract character with multiple codepoints solved 130the code space exhaustion-problem, and the concept has been greatly 131expanded since its first introduction (emojis, joiners, etc.). A sequence 132(which can also have the length 1) of codepoints that belong together 133this way and represents an abstract character is called a 134.Dq grapheme cluster . 135.Pp 136In many applications it is necessary to count the number of 137user-perceived characters, i.e. grapheme clusters, in a string. 138A good example for this is a terminal text editor, which needs to 139properly align characters on a grid. 140This is pretty simple with ASCII-strings, where you just count the number 141of bytes (as each byte is a codepoint and each codepoint is a grapheme 142cluster). 143With Unicode-strings, it is a common mistake to simply adapt the 144ASCII-approach and count the number of code points. 145This is wrong, as, for example, the sequence 146.Dq 0x41 0x308 0x304 , 147while made up of 3 codepoints, is a single grapheme cluster and 148represents the user-perceived character 149.Sq \[u01DE] . 150.Pp 151The proper way to segment a string into user-perceived characters 152is to segment it into its grapheme clusters by applying the Unicode 153grapheme cluster breaking algorithm (UAX #29). 154It is based on a complex ruleset and lookup-tables and determines if a 155grapheme cluster ends or is continued between two codepoints. 156Libraries like ICU and libunistring, which also offer this functionality, 157are often bloated, not correct, difficult to use or not reasonably 158statically linkable. 159.Pp 160Analogously, the standard provides algorithms to separate strings by 161words, sentences and lines, convert cases and compare strings. 162The motivation behind 163.Nm 164is to make unicode handling suck less and abide by the UNIX philosophy. 165.Sh AUTHORS 166.An Laslo Hunhold Aq Mt dev@frign.de 167EOF