libgrapheme

Freestanding C library for unicode string handling
git clone https://git.sinitax.com/suckless/libgrapheme
Log | Files | Refs | README | LICENSE | sfeed.txt

commit 59952de9863572fbca88c3f9f1292709d381407b
parent fdfcc49755f22074116fd52765bea1a60d3539ba
Author: Laslo Hunhold <dev@frign.de>
Date:   Sat, 18 Dec 2021 13:24:30 +0100

Consistently refer to "codepoints" as "codepoints", not "code points"

Both are valid forms and Unicode prefers the latter, but maybe it's
because I'm a German speaker (known for ridiculous compound words like
"Donaudampfschiffahrtselektrizitätenhauptbetriebswerkbauunterbeamter")
and like compound words that I prefer the former.

Signed-off-by: Laslo Hunhold <dev@frign.de>

Diffstat:
Mgen/util.c | 2+-
Mman/grapheme_character_isbreak.3 | 6+++---
Mman/grapheme_utf8_decode.3 | 8++++----
Mman/grapheme_utf8_encode.3 | 4++--
Mman/libgrapheme.7 | 34+++++++++++++++++-----------------
Msrc/character.c | 4++--
Msrc/utf8.c | 10+++++-----
Mtest/utf8-decode.c | 2+-
Mtest/utf8-encode.c | 14+++++++-------
9 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/gen/util.c b/gen/util.c @@ -325,7 +325,7 @@ segment_test_callback(char *fname, char **field, size_t nfields, char *comment, return 1; } } else { - /* add code point to cp-array */ + /* add codepoint to cp-array */ if ((t->cp = realloc(t->cp, ++t->cplen * sizeof(*t->cp))) == NULL) { fprintf(stderr, "segment_test_callback: realloc: %s.\n", strerror(errno)); diff --git a/man/grapheme_character_isbreak.3 b/man/grapheme_character_isbreak.3 @@ -3,7 +3,7 @@ .Os suckless.org .Sh NAME .Nm grapheme_character_isbreak -.Nd test for a grapheme cluster break between two code points +.Nd test for a grapheme cluster break between two codepoints .Sh SYNOPSIS .In grapheme.h .Ft size_t @@ -13,7 +13,7 @@ The .Fn grapheme_character_isbreak function determines if there is a grapheme cluster break (see .Xr libgrapheme 7 ) -between the two code points +between the two codepoints .Va cp1 and .Va cp2 . @@ -33,7 +33,7 @@ The .Fn grapheme_character_isbreak function returns .Va true -if there is a grapheme cluster break between the code points +if there is a grapheme cluster break between the codepoints .Va cp1 and .Va cp2 diff --git a/man/grapheme_utf8_decode.3 b/man/grapheme_utf8_decode.3 @@ -3,7 +3,7 @@ .Os suckless.org .Sh NAME .Nm grapheme_utf8_decode -.Nd decode first code point in UTF-8-encoded string +.Nd decode first codepoint in UTF-8-encoded string .Sh SYNOPSIS .In grapheme.h .Ft size_t @@ -11,20 +11,20 @@ .Sh DESCRIPTION The .Fn grapheme_utf8_decode -function decodes the next code point in the UTF-8-encoded string +function decodes the next codepoint in the UTF-8-encoded string .Va str of length .Va len . If the UTF-8-sequence is invalid (overlong encoding, unexpected byte, string ends unexpectedly, empty string, etc.) the decoding is stopped -at the last processed byte and the decoded code point set to +at the last processed byte and the decoded codepoint set to .Dv GRAPHEME_INVALID_CODE_POINT. .Pp If .Va cp is not .Dv NULL -the decoded code point is stored in the memory pointed to by +the decoded codepoint is stored in the memory pointed to by .Va cp . .Pp Given NUL has a unique 1 byte representation, it is safe to operate on diff --git a/man/grapheme_utf8_encode.3 b/man/grapheme_utf8_encode.3 @@ -3,7 +3,7 @@ .Os suckless.org .Sh NAME .Nm grapheme_utf8_encode -.Nd encode code point into UTF-8 string +.Nd encode codepoint into UTF-8 string .Sh SYNOPSIS .In grapheme.h .Ft size_t @@ -11,7 +11,7 @@ .Sh DESCRIPTION The .Fn grapheme_utf8_encode -function encodes the code point +function encodes the codepoint .Va cp into a UTF-8-string. If diff --git a/man/libgrapheme.7 b/man/libgrapheme.7 @@ -29,25 +29,25 @@ making up a written language). ASCII for instance, which comprises the range 0 to 127, assigns the number 65 (0x41) to the abstract character .Sq A . This number is called a -.Dq code point , -and all code points of an encoding make up its so-called +.Dq codepoint , +and all codepoints of an encoding make up its so-called .Dq code space . .Pp Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its -first 128 code points are identical to ASCII's. The additional code +first 128 codepoints are identical to ASCII's. The additional code points are needed as Unicode's goal is to express all writing systems of the world. To give an example, the abstract character .Sq \[u00C4] -is not expressable in ASCII, given no ASCII code point has been assigned -to it. It can be expressed in Unicode, though, with the code point 196 +is not expressable in ASCII, given no ASCII codepoint has been assigned +to it. It can be expressed in Unicode, though, with the codepoint 196 (0xC4). .Pp One may assume that this process is straightfoward, but as more and -more code points were assigned to abstract characters, the Unicode +more codepoints were assigned to abstract characters, the Unicode Consortium (that defines the Unicode standard) was facing a problem: Many (mostly non-European) languages have such a large amount of abstract characters that it would exhaust the available Unicode code -space if one tried to assign a code point to each abstract character. The +space if one tried to assign a codepoint to each abstract character. The solution to that problem is best introduced with an example: Consider the abstract character .Sq \[u01DE] , @@ -63,9 +63,9 @@ of the .Dq base character .Sq A . .Pp -The Unicode Consortium adapted this idea by assigning code points to -modifications. For example, the code point 0x308 represents adding an -umlaut and 0x304 represents adding a macron, and thus, the code point +The Unicode Consortium adapted this idea by assigning codepoints to +modifications. For example, the codepoint 0x308 represents adding an +umlaut and 0x304 represents adding a macron, and thus, the codepoint sequence .Dq 0x41 0x308 0x304 , namely the base character @@ -73,15 +73,15 @@ namely the base character followed by the umlaut and macron modifiers, represents the abstract character .Sq \[u01DE] . -As a side-note, the single code point 0x1DE was also assigned to +As a side-note, the single codepoint 0x1DE was also assigned to .Sq \[u01DE] , which is a good example for the fact that there can be multiple representations of a single abstract character in Unicode. .Pp -Expressing a single abstract character with multiple code points solved +Expressing a single abstract character with multiple codepoints solved the code space exhaustion-problem, and the concept has been greatly expanded since its first introduction (emojis, joiners, etc.). A sequence -(which can also have the length 1) of code points that belong together +(which can also have the length 1) of codepoints that belong together this way and represents an abstract character is called a .Dq grapheme cluster . .Pp @@ -89,12 +89,12 @@ In many applications it is necessary to count the number of user-perceived characters, i.e. grapheme clusters, in a string. A good example for this is a terminal text editor, which needs to properly align characters on a grid. This is pretty simple with ASCII-strings, where you -just count the number of bytes (as each byte is a code point and each -code point is a grapheme cluster). With Unicode-strings, it is a common +just count the number of bytes (as each byte is a codepoint and each +codepoint is a grapheme cluster). With Unicode-strings, it is a common mistake to simply adapt the ASCII-approach and count the number of code points. This is wrong, as, for example, the sequence .Dq 0x41 0x308 0x304 , -while made up of 3 code points, is a single grapheme cluster and +while made up of 3 codepoints, is a single grapheme cluster and represents the user-perceived character .Sq \[u01DE] . .Pp @@ -102,7 +102,7 @@ The proper way to segment a string into user-perceived characters is to segment it into its grapheme clusters by applying the Unicode grapheme cluster breaking algorithm (UAX #29). It is based on a complex ruleset and lookup-tables and determines if a grapheme cluster ends or -is continued between two code points. Libraries like ICU, which also +is continued between two codepoints. Libraries like ICU, which also offer this functionality, are often bloated, not correct, difficult to use or not statically linkable. The motivation behind .Nm diff --git a/src/character.c b/src/character.c @@ -201,14 +201,14 @@ grapheme_character_nextbreak(const char *str) * the null byte for the reasons given above. */ - /* get first code point */ + /* get first codepoint */ len += grapheme_utf8_decode(str, (size_t)-1, &cp0); if (cp0 == GRAPHEME_INVALID_CODE_POINT) { return len; } while (cp0 != 0) { - /* get next code point */ + /* get next codepoint */ ret = grapheme_utf8_decode(str + len, (size_t)-1, &cp1); if (cp1 == GRAPHEME_INVALID_CODE_POINT || diff --git a/src/utf8.c b/src/utf8.c @@ -10,8 +10,8 @@ static const struct { uint8_t lower; /* lower bound of sequence first byte */ uint8_t upper; /* upper bound of sequence first byte */ - uint_least32_t mincp; /* smallest non-overlong encoded code point */ - uint_least32_t maxcp; /* largest encodable code point */ + uint_least32_t mincp; /* smallest non-overlong encoded codepoint */ + uint_least32_t maxcp; /* largest encodable codepoint */ /* * implicit: table-offset represents the number of following * bytes of the form 10xxxxxx (6 bits capacity each) @@ -129,7 +129,7 @@ grapheme_utf8_decode(const char *s, size_t n, uint_least32_t *cp) return 1 + (i - 1); } /* - * shift code point by 6 bits and add the 6 stored bits + * shift codepoint by 6 bits and add the 6 stored bits * in s[i] to it using the bitmask 0x3F (00111111) */ *cp = (*cp << 6) | (((const unsigned char *)s)[i] & 0x3F); @@ -139,7 +139,7 @@ grapheme_utf8_decode(const char *s, size_t n, uint_least32_t *cp) BETWEEN(*cp, UINT32_C(0xD800), UINT32_C(0xDFFF)) || *cp > UINT32_C(0x10FFFF)) { /* - * code point is overlong encoded in the sequence, is a + * codepoint is overlong encoded in the sequence, is a * high or low UTF-16 surrogate half (0xD800..0xDFFF) or * not representable in UTF-16 (>0x10FFFF) (RFC-3629 * specifies the latter two conditions) @@ -158,7 +158,7 @@ grapheme_utf8_encode(uint_least32_t cp, char *s, size_t n) if (BETWEEN(cp, UINT32_C(0xD800), UINT32_C(0xDFFF)) || cp > UINT32_C(0x10FFFF)) { /* - * code point is a high or low UTF-16 surrogate half + * codepoint is a high or low UTF-16 surrogate half * (0xD800..0xDFFF) or not representable in UTF-16 * (>0x10FFFF), which RFC-3629 deems invalid for UTF-8. */ diff --git a/test/utf8-decode.c b/test/utf8-decode.c @@ -11,7 +11,7 @@ static const struct { char *arr; /* UTF-8 byte sequence */ size_t len; /* length of UTF-8 byte sequence */ size_t exp_len; /* expected length returned */ - uint_least32_t exp_cp; /* expected code point returned */ + uint_least32_t exp_cp; /* expected codepoint returned */ } dec_test[] = { { /* empty sequence diff --git a/test/utf8-encode.c b/test/utf8-encode.c @@ -8,42 +8,42 @@ #include "util.h" static const struct { - uint_least32_t cp; /* input code point */ + uint_least32_t cp; /* input codepoint */ char *exp_arr; /* expected UTF-8 byte sequence */ size_t exp_len; /* expected length of UTF-8 sequence */ } enc_test[] = { { - /* invalid code point (UTF-16 surrogate half) */ + /* invalid codepoint (UTF-16 surrogate half) */ .cp = UINT32_C(0xD800), .exp_arr = (char *)(unsigned char[]){ 0xEF, 0xBF, 0xBD }, .exp_len = 3, }, { - /* invalid code point (UTF-16-unrepresentable) */ + /* invalid codepoint (UTF-16-unrepresentable) */ .cp = UINT32_C(0x110000), .exp_arr = (char *)(unsigned char[]){ 0xEF, 0xBF, 0xBD }, .exp_len = 3, }, { - /* code point encoded to a 1-byte sequence */ + /* codepoint encoded to a 1-byte sequence */ .cp = 0x01, .exp_arr = (char *)(unsigned char[]){ 0x01 }, .exp_len = 1, }, { - /* code point encoded to a 2-byte sequence */ + /* codepoint encoded to a 2-byte sequence */ .cp = 0xFF, .exp_arr = (char *)(unsigned char[]){ 0xC3, 0xBF }, .exp_len = 2, }, { - /* code point encoded to a 3-byte sequence */ + /* codepoint encoded to a 3-byte sequence */ .cp = 0xFFF, .exp_arr = (char *)(unsigned char[]){ 0xE0, 0xBF, 0xBF }, .exp_len = 3, }, { - /* code point encoded to a 4-byte sequence */ + /* codepoint encoded to a 4-byte sequence */ .cp = UINT32_C(0xFFFFF), .exp_arr = (char *)(unsigned char[]){ 0xF3, 0xBF, 0xBF, 0xBF }, .exp_len = 4,