libgrapheme

Freestanding C library for unicode string handling
git clone https://git.sinitax.com/suckless/libgrapheme
Log | Files | Refs | README | LICENSE | sfeed.txt

commit 42e58c7d3a921540f5d901b80a0cc75e234b02e9
parent c7021f101ce95bff58157fb32c50d204cf8569b2
Author: Laslo Hunhold <dev@frign.de>
Date:   Wed, 22 Dec 2021 15:20:27 +0100

Add a remark on standard conformance in README

Signed-off-by: Laslo Hunhold <dev@frign.de>

Diffstat:
MREADME | 20++++++++++++++++++++
Mman/grapheme_decode_utf8.3 | 2+-
Mman/grapheme_encode_utf8.3 | 2+-
Mman/grapheme_is_character_break.3 | 2+-
Mman/grapheme_next_character_break.3 | 2+-
Mman/libgrapheme.7 | 13++++++++++++-
6 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/README b/README @@ -7,6 +7,13 @@ up of user-perceived characters (so-called "grapheme clusters") that are made up of one or more Unicode codepoints, which in turn are encoded in one or more bytes in an encoding like UTF-8. +There is a widespread misconception that it was enough to simply +determine codepoints in a string and treat them as user-perceived +characters to be Unicode compliant. While this may work in some cases, +this assumption quickly breaks, especially for non-Western languages and +decomposed Unicode strings where user-perceived characters are usually +represented using multiple codepoints. + Despite the complicated multilevel structure of Unicode strings, libgrapheme provides methods to work with them at the byte-level (i.e. UTF-8 ‘char’ arrays) while also providing codepoint-level methods. @@ -28,6 +35,19 @@ Afterwards enter the following command to build and install libgrapheme make install +Conformance +----------- +The libgrapheme library is compliant with the Unicode 14.0.0 +specification (September 2021). + +To ensure conformance, libgrapheme includes hundreds of tests including +all provided with the standard-provided test-data that is parsed +automatically. The tests can be run with + + make test + +to check standard conformance. + Usage ----- Include the header grapheme.h in your code and link against libgrapheme diff --git a/man/grapheme_decode_utf8.3 b/man/grapheme_decode_utf8.3 @@ -1,4 +1,4 @@ -.Dd 2021-12-19 +.Dd 2021-12-22 .Dt GRAPHEME_DECODE_UTF8 3 .Os suckless.org .Sh NAME diff --git a/man/grapheme_encode_utf8.3 b/man/grapheme_encode_utf8.3 @@ -1,4 +1,4 @@ -.Dd 2021-12-17 +.Dd 2021-12-22 .Dt GRAPHEME_ENCODE_UTF8 3 .Os suckless.org .Sh NAME diff --git a/man/grapheme_is_character_break.3 b/man/grapheme_is_character_break.3 @@ -1,4 +1,4 @@ -.Dd 2021-12-18 +.Dd 2021-12-22 .Dt GRAPHEME_IS_CHARACTER_BREAK 3 .Os suckless.org .Sh NAME diff --git a/man/grapheme_next_character_break.3 b/man/grapheme_next_character_break.3 @@ -1,4 +1,4 @@ -.Dd 2021-12-18 +.Dd 2021-12-22 .Dt GRAPHEME_NEXT_CHARACTER_BREAK 3 .Os suckless.org .Sh NAME diff --git a/man/libgrapheme.7 b/man/libgrapheme.7 @@ -1,4 +1,4 @@ -.Dd 2021-12-19 +.Dd 2021-12-22 .Dt LIBGRAPHEME 7 .Os suckless.org .Sh NAME @@ -18,11 +18,22 @@ see that are made up of one or more Unicode codepoints, which in turn are encoded in one or more bytes in an encoding like UTF-8. .Pp +There is a widespread misconception that it was enough to simply +determine codepoints in a string and treat them as user-perceived +characters to be Unicode compliant. +While this may work in some cases, this assumption quickly breaks, +especially for non-Western languages and decomposed Unicode strings +where user-perceived characters are usually represented using multiple +codepoints. +.Pp Despite this complicated multilevel structure of Unicode strings, .Nm provides methods to work with them at the byte-level (i.e. UTF-8 .Sq char arrays) while also offering codepoint-level methods. +.Pp +Every documented function's manual page provides a self-contained +example illustrating the possible usage. .Sh SEE ALSO .Xr grapheme_decode_utf8 3 , .Xr grapheme_encode_utf8 3 ,