libgrapheme

Freestanding C library for unicode string handling
git clone https://git.sinitax.com/suckless/libgrapheme
Log | Files | Refs | README | LICENSE | sfeed.txt

libgrapheme.sh (6346B)


      1cat << EOF
      2.Dd ${MAN_DATE}
      3.Dt LIBGRAPHEME 7
      4.Os suckless.org
      5.Sh NAME
      6.Nm libgrapheme
      7.Nd unicode string library
      8.Sh SYNOPSIS
      9.In grapheme.h
     10.Sh DESCRIPTION
     11The
     12.Nm
     13library provides functions to properly handle Unicode strings according
     14to the Unicode specification in regard to character, word, sentence and
     15line segmentation and case detection and conversion.
     16.Pp
     17Unicode strings are made up of user-perceived characters (so-called
     18.Dq grapheme clusters ,
     19see
     20.Sx MOTIVATION )
     21that are composed of one or more Unicode codepoints, which in turn
     22are encoded in one or more bytes in an encoding like UTF-8.
     23.Pp
     24There is a widespread misconception that it was enough to simply
     25determine codepoints in a string and treat them as user-perceived
     26characters to be Unicode compliant.
     27While this may work in some cases, this assumption quickly breaks,
     28especially for non-Western languages and decomposed Unicode strings
     29where user-perceived characters are usually represented using multiple
     30codepoints.
     31.Pp
     32Despite this complicated multilevel structure of Unicode strings,
     33.Nm
     34provides methods to work with them at the byte-level (i.e. UTF-8
     35.Sq char
     36arrays) while also offering codepoint-level methods.
     37Additionally, it is a
     38.Dq freestanding
     39library (see ISO/IEC 9899:1999 section 4.6) and thus does not depend on
     40a standard library. This makes it easy to use in bare metal environments.
     41.Pp
     42Every documented function's manual page provides a self-contained
     43example illustrating the possible usage.
     44.Sh SEE ALSO
     45.Xr grapheme_decode_utf8 3 ,
     46.Xr grapheme_encode_utf8 3 ,
     47.Xr grapheme_is_character_break 3 ,
     48.Xr grapheme_is_lowercase 3 ,
     49.Xr grapheme_is_lowercase_utf8 3 ,
     50.Xr grapheme_is_titlecase 3 ,
     51.Xr grapheme_is_titlecase_utf8 3 ,
     52.Xr grapheme_is_uppercase 3 ,
     53.Xr grapheme_is_uppercase_utf8 3 ,
     54.Xr grapheme_next_character_break 3 ,
     55.Xr grapheme_next_character_break_utf8 3 ,
     56.Xr grapheme_next_line_break 3 ,
     57.Xr grapheme_next_line_break_utf8 3 ,
     58.Xr grapheme_next_sentence_break 3 ,
     59.Xr grapheme_next_sentence_break_utf8 3 ,
     60.Xr grapheme_next_word_break 3 ,
     61.Xr grapheme_next_word_break_utf8 3 ,
     62.Xr grapheme_to_lowercase 3 ,
     63.Xr grapheme_to_lowercase_utf8 3 ,
     64.Xr grapheme_to_titlecase 3 ,
     65.Xr grapheme_to_titlecase_utf8 3
     66.Xr grapheme_to_uppercase 3 ,
     67.Xr grapheme_to_uppercase_utf8 3 ,
     68.Sh STANDARDS
     69.Nm
     70is compliant with the Unicode ${UNICODE_VERSION} specification.
     71.Sh MOTIVATION
     72The idea behind every character encoding scheme like ASCII or Unicode
     73is to express abstract characters (which can be thought of as shapes
     74making up a written language). ASCII for instance, which comprises the
     75range 0 to 127, assigns the number 65 (0x41) to the abstract character
     76.Sq A .
     77This number is called a
     78.Dq codepoint ,
     79and all codepoints of an encoding make up its so-called
     80.Dq code space .
     81.Pp
     82Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its
     83first 128 codepoints are identical to ASCII's. The additional code
     84points are needed as Unicode's goal is to express all writing systems
     85of the world.
     86To give an example, the abstract character
     87.Sq \[u00C4]
     88is not expressable in ASCII, given no ASCII codepoint has been assigned
     89to it.
     90It can be expressed in Unicode, though, with the codepoint 196 (0xC4).
     91.Pp
     92One may assume that this process is straightfoward, but as more and
     93more codepoints were assigned to abstract characters, the Unicode
     94Consortium (that defines the Unicode standard) was facing a problem:
     95Many (mostly non-European) languages have such a large amount of
     96abstract characters that it would exhaust the available Unicode code
     97space if one tried to assign a codepoint to each abstract character.
     98The solution to that problem is best introduced with an example: Consider
     99the abstract character
    100.Sq \[u01DE] ,
    101which is
    102.Sq A
    103with an umlaut and a macron added to it.
    104In this sense, one can consider
    105.Sq \[u01DE]
    106as a two-fold modification (namely
    107.Dq add umlaut
    108and
    109.Dq add macron )
    110of the
    111.Dq base character
    112.Sq A .
    113.Pp
    114The Unicode Consortium adapted this idea by assigning codepoints to
    115modifications.
    116For example, the codepoint 0x308 represents adding an umlaut and 0x304
    117represents adding a macron, and thus, the codepoint sequence
    118.Dq 0x41 0x308 0x304 ,
    119namely the base character
    120.Sq A
    121followed by the umlaut and macron modifiers, represents the abstract
    122character
    123.Sq \[u01DE] .
    124As a side-note, the single codepoint 0x1DE was also assigned to
    125.Sq \[u01DE] ,
    126which is a good example for the fact that there can be multiple
    127representations of a single abstract character in Unicode.
    128.Pp
    129Expressing a single abstract character with multiple codepoints solved
    130the code space exhaustion-problem, and the concept has been greatly
    131expanded since its first introduction (emojis, joiners, etc.). A sequence
    132(which can also have the length 1) of codepoints that belong together
    133this way and represents an abstract character is called a
    134.Dq grapheme cluster .
    135.Pp
    136In many applications it is necessary to count the number of
    137user-perceived characters, i.e. grapheme clusters, in a string.
    138A good example for this is a terminal text editor, which needs to
    139properly align characters on a grid.
    140This is pretty simple with ASCII-strings, where you just count the number
    141of bytes (as each byte is a codepoint and each codepoint is a grapheme
    142cluster).
    143With Unicode-strings, it is a common mistake to simply adapt the
    144ASCII-approach and count the number of code points.
    145This is wrong, as, for example, the sequence
    146.Dq 0x41 0x308 0x304 ,
    147while made up of 3 codepoints, is a single grapheme cluster and
    148represents the user-perceived character
    149.Sq \[u01DE] .
    150.Pp
    151The proper way to segment a string into user-perceived characters
    152is to segment it into its grapheme clusters by applying the Unicode
    153grapheme cluster breaking algorithm (UAX #29).
    154It is based on a complex ruleset and lookup-tables and determines if a
    155grapheme cluster ends or is continued between two codepoints.
    156Libraries like ICU and libunistring, which also offer this functionality,
    157are often bloated, not correct, difficult to use or not reasonably
    158statically linkable.
    159.Pp
    160Analogously, the standard provides algorithms to separate strings by
    161words, sentences and lines, convert cases and compare strings.
    162The motivation behind
    163.Nm
    164is to make unicode handling suck less and abide by the UNIX philosophy.
    165.Sh AUTHORS
    166.An Laslo Hunhold Aq Mt dev@frign.de
    167EOF