utf8proc

A clean C library for processing UTF-8 Unicode data
git clone https://git.sinitax.com/juliastrings/utf8proc
Log | Files | Refs | README | LICENSE | sfeed.txt

commit e41bd981cbc2242b5e44da0bef48fd0f57065fed
parent 0d7224a6d8a77e5eebf5e18bded742490f3b20fd
Author: Steven G. Johnson <stevenj@mit.edu>
Date:   Tue, 15 Jul 2014 21:50:23 -0400

markdown fixes, prettified NEWS

Diffstat:
DChangelog | 131-------------------------------------------------------------------------------
MLICENSE.md | 6+++---
ANEWS.md | 142+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
MREADME.md | 12++++++------
4 files changed, 151 insertions(+), 140 deletions(-)

diff --git a/Changelog b/Changelog @@ -1,131 +0,0 @@ -Changelog - -2006-06-02: -- initial release of version 0.1 - -2006-06-05: -- changed behaviour of PostgreSQL function to return NULL in case of - invalid input, rather than raising an exceptional condition -- improved efficiency of PostgreSQL function (no transformation to C string - is done) - -2006-06-20: -- added -fpic compiler flag in Makefile -- fixed bug in the C code for the ruby library (usage of non-existent - function) - -Release of version 0.2 - - -2006-07-18: -- changed normalization from NFC to NFKC for postgresql unifold function - -2006-08-04: -- added support to mark the beginning of a grapheme cluster with 0xFF - (option: CHARBOUND) -- added the ruby method String#chars, which is returning an array of UTF-8 - encoded grapheme clusters -- added NLF2LF transformation in postgresql unifold function -- added the DECOMPOSE option, if you neither use COMPOSE or DECOMPOSE, no - normalization will be performed (different from previous versions) -- using integer constants rather than C-strings for character properties -- fixed (hopefully) a problem with the ruby library on Mac OS X, which - occured when compiler optimization was switched on - -Release of version 0.3 - - -2006-09-17: -- added the LUMP option, which lumps certain characters together - (see lump.txt) (also used for the PostgreSQL "unifold" function) -- added the STRIPMARK option, which strips marking characters - (or marks of composed characters) -- deprecated ruby method String#char_ary in favour of String#utf8chars - -Release of version 1.0 - - -2006-09-20: -- included a gem file for the ruby version of the library - -Release of version 1.0.1 - - -2006-09-21: -- included a check in Integer#utf8, which raises an exception, if the given - code-point is invalid because of being too high (this was missing yet) - -2006-12-26: -- added support for PostgreSQL version 8.2 - -Release of version 1.0.2 - - -2007-03-16: -- Fixed a bug in the ruby library, which caused an error, when splitting an - empty string at grapheme cluster boundaries (method String#utf8chars). - -Release of version 1.0.3 - - -2007-06-25: -- Added a new PostgreSQL function 'unistrip', which behaves like 'unifold', - but also removes all character marks (e.g. accents). - -2007-07-22: -- Changed license from BSD to MIT style. -- Added a new function 'utf8proc_codepoint_valid' to the C library. -- Changed compiler flags in Makefile from -g -O0 to -O2 -- The ruby script, which was used to build the utf8proc_data.c file, is now - included in the distribution. - -Release of version 1.1.1 - - -2007-07-25: -- Fixed a serious bug in the data file generator, which caused characters - being treated incorrectly, when stripping default ignorable characters or - calculating grapheme cluster boundaries. - -Release of version 1.1.2 - - -2008-10-04: -- Added a function utf8proc_version returning a string containing the version - number of the library. -- Included a target libutf8proc.dylib for MacOSX. - -2009-05-01: -- PostgreSQL 8.3 compatibility (use of SET_VARSIZE macro) - -Release of version 1.1.3 - - -2009-06-14: -- replaced C++ style comments for compatibility reasons -- added typecasts to suppress compiler warnings -- removed redundant source files for ruby-gemfile generation - -2009-08-19: -- Changed copyright notice for Public Software Group e. V. -- Minor changes in the README file -- Release of version 1.1.4 - -2009-08-20: -- Use RSTRING_PTR() and RSTRING_LEN() instead of RSTRING()->ptr and - RSTRING()->len for ruby1.9 compatibility (and #define them, if not - existent) - -2009-10-02: -- Patches for compatibility with Microsoft Visual Studio - -2009-10-08: -- Fixes to make utf8proc usable in C++ programs - -2009-10-16: -- Release of version 1.1.5 - -2013-11-27: -- PostgreSQL 9.2 and 9.3 compatibility (lowercase 'c' language name) -- Release of version 1.1.6 - diff --git a/LICENSE.md b/LICENSE.md @@ -1,4 +1,4 @@ -== libutf8proc license == +## libutf8proc license ## **libutf8proc** is a lightly updated version of the **utf8proc** library by Jan Behrens and the rest of the Public Software Group, who @@ -27,7 +27,7 @@ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -== Original utf8proc license == +## Original utf8proc license ## *Copyright (c) 2009, 2013 Public Software Group e. V., Berlin, Germany* @@ -49,7 +49,7 @@ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -== Unicode data license == +## Unicode data license ## This software distribution contains derived data from a modified version of the Unicode data files. The following license applies to that data: diff --git a/NEWS.md b/NEWS.md @@ -0,0 +1,142 @@ +# libutf8proc release history # + +No releases so far. + +# utf8proc release history # + +## Version 1.1.6 ## + +2013-11-27: + +- PostgreSQL 9.2 and 9.3 compatibility (lowercase `c` language name) + +## Version 1.1.5 ## + +2009-08-20: + +- Use `RSTRING_PTR()` and `RSTRING_LEN()` instead of `RSTRING()->ptr` and + `RSTRING()->len` for ruby1.9 compatibility (and `#define` them, if not + existent) + +2009-10-02: + +- Patches for compatibility with Microsoft Visual Studio + +2009-10-08: + +- Fixes to make utf8proc usable in C++ programs + +2009-10-16: + +## Version 1.1.4 ## + +2009-06-14: + +- replaced C++ style comments for compatibility reasons +- added typecasts to suppress compiler warnings +- removed redundant source files for ruby-gemfile generation + +2009-08-19: + +- Changed copyright notice for Public Software Group e. V. +- Minor changes in the `README` file + +## Version 1.1.3 ## + +2008-10-04: + +- Added a function `utf8proc_version` returning a string containing the version + number of the library. +- Included a target `libutf8proc.dylib` for MacOSX. + +2009-05-01: +- PostgreSQL 8.3 compatibility (use of `SET_VARSIZE` macro) + +## Version 1.1.2 ## + +2007-07-25: + +- Fixed a serious bug in the data file generator, which caused characters + being treated incorrectly, when stripping default ignorable characters or + calculating grapheme cluster boundaries. + +## Version 1.1.1 ## + +2007-06-25: + +- Added a new PostgreSQL function `unistrip`, which behaves like `unifold`, + but also removes all character marks (e.g. accents). + +2007-07-22: + +- Changed license from BSD to MIT style. +- Added a new function `utf8proc_codepoint_valid` to the C library. +- Changed compiler flags in `Makefile` from `-g -O0` to `-O2` +- The ruby script, which was used to build the `utf8proc_data.c` file, is now + included in the distribution. + +## Version 1.0.3 ## + +2007-03-16: + +- Fixed a bug in the ruby library, which caused an error, when splitting an + empty string at grapheme cluster boundaries (method `String#utf8chars`). + +## Version 1.0.2 ## + +2006-09-21: + +- included a check in `Integer#utf8`, which raises an exception, if the given + code-point is invalid because of being too high (this was missing yet) + +2006-12-26: + +- added support for PostgreSQL version 8.2 + +## Version 1.0.1 ## + +2006-09-20: + +- included a gem file for the ruby version of the library + +Release of version 1.0.1 + +## Version 1.0 ## + +2006-09-17: + +- added the `LUMP` option, which lumps certain characters together (see `lump.txt`) (also used for the PostgreSQL `unifold` function) +- added the `STRIPMARK` option, which strips marking characters (or marks of composed characters) +- deprecated ruby method `String#char_ary` in favour of `String#utf8chars` + +## Version 0.3 ## + +2006-07-18: + +- changed normalization from NFC to NFKC for postgresql unifold function + +2006-08-04: + +- added support to mark the beginning of a grapheme cluster with 0xFF (option: `CHARBOUND`) +- added the ruby method `String#chars`, which is returning an array of UTF-8 encoded grapheme clusters +- added `NLF2LF` transformation in postgresql `unifold` function +- added the `DECOMPOSE` option, if you neither use `COMPOSE` or `DECOMPOSE`, no normalization will be performed (different from previous versions) +- using integer constants rather than C-strings for character properties +- fixed (hopefully) a problem with the ruby library on Mac OS X, which occured when compiler optimization was switched on + +## Version 0.2 ## + +2006-06-05: + +- changed behaviour of PostgreSQL function to return NULL in case of invalid input, rather than raising an exceptional condition +- improved efficiency of PostgreSQL function (no transformation to C string is done) + +2006-06-20: + +- added -fpic compiler flag in Makefile +- fixed bug in the C code for the ruby library (usage of non-existent function) + +## Version 0.1 ## + +2006-06-02: initial release of version 0.1 + diff --git a/README.md b/README.md @@ -1,4 +1,4 @@ -== libutf8proc == +# libutf8proc # The [libutf8proc package](https://github.com/JuliaLang/libutf8proc) is a lightly updated fork of the [utf8proc @@ -28,11 +28,11 @@ data governed by the similarly permissive [Unicode data license](http://www.unicode.org/copyright.html#Exhibit1)); please see the included `LICENSE.md` file for more detailed information. -=== Quick Start === +## Quick Start ## For compilation of the C library run `make`. -=== General Information === +## General Information ## The C library is found in this directory after successful compilation and is named `libutf8proc.a` (for the static library) and @@ -49,19 +49,19 @@ For Unicode normalizations, the following options are used: * Normalization Form KC: `STABLE`, `COMPOSE`, `COMPAT` * Normalization Form KD: `STABLE`, `DECOMPOSE`, `COMPAT` -=== C Library === +## C Library ## The documentation for the C library is found in the `utf8proc.h` header file. `utf8proc_map` is function you will most likely be using for mapping UTF-8 strings, unless you want to allocate memory yourself. -=== To Do === +## To Do ## * detect stable code points and process segments independently in order to save memory * do a quick check before normalizing strings to optimize speed * support stream processing -=== Contact === +## Contact ## Bug reports, feature requests, and other queries can be filed at the [libutf8proc page on Github](https://github.com/JuliaLang/libutf8proc).