libgrapheme

Freestanding C library for unicode string handling
git clone https://git.sinitax.com/suckless/libgrapheme
Log | Files | Refs | README | LICENSE | sfeed.txt

commit 5998352d2d2e6e37531548f8e986abae5ff8ef02
parent dd15fea026c3e0b389381ae8cc08e0f39fa1a8f7
Author: Laslo Hunhold <dev@frign.de>
Date:   Tue, 25 Oct 2022 13:20:47 +0200

Implement the Unicode Bidirectional Algorithm (UAX #9)

To be frank, I never heard about this until I started learning more
about Unicode, but this is an absolute must for all languages that go
from right to left (Hebrew, Arabic, Farsi, etc.) and any case where you
mix RTL and LTR languages.

The Unicode Bidirectional Algorithm is the normative procedure you apply
on a string to obtain embedding levels that can then be used to reorder
the string such that you obtain the proper reading direction. The
central aspect is that strings are always stored LTR in memory and only
reordered for presentation on the screen.

Currently, only ICU and GNU fribidi implement the algorithm, and as
usual it's pretty convoluted to use them. There are many memory
allocations, kitchen-sink-madness and legacy cruft, but the demand is
there (there's even a bidi-patch for dwm[0]).

What's special about this implementation? There are no memory
allocations at runtime. The user provides a 32-bit-integer-array which
is then filled with the embedding levels. The levels themselves only
range from -1 to 125 (by the standard!) and would fit in a signed
8-bit-integer, but the algorithm naturally needs a scratchpad to store
processing data.

A complication of the algorithm is that you, at some point, have to
break the paragraph into lines and based on the line breaks the level
determination is affected. GNU fribidi and ICU make this very
complicated and hard to understand. The API is not final as you see it
here, but the final process will be (each number corresponding to a
function):

	1) "preprocessing" the string up to the part where the algorithm
	   does not depend on the line breaks
	2) determining line embedding levels for a line
	   (by specifying the preprocessed data buffer and an output
	   level-buffer)
	3) reordering a line (by specifying the preprocessed data buffer
	   and an output string that is allowed to be the input string)

Conformance is obviously a large priority: There are literally over a
million automatic conformance tests for the bidirectional algorithm split
across the files BidiTest.txt and BidiCharacterTest.txt that are
automatically parsed into the header gen/bidirectional-test.h.

Currently, only BidiTest.txt is used for tests (which we all pass),
given bracket-pairs have not been implemented yet. This and (maybe)
arabic shaping are what is left to be implemented, but this here is
already a big step.

One more note: Yes, the data files are very large, but they compress
down very well and the tarball stays below 800K. It's very important
to me that there's no need to pull any data from the web for compilation
or testing for obvious reasons.

[0]:https://dwm.suckless.org/patches/bidi/

Signed-off-by: Laslo Hunhold <dev@frign.de>

Diff is too large, output suppressed.