TIP: | 389 |
Title: | Full support for Unicode 8.0 and later |
Version: | $Revision: 1.4 $ |
Authors: |
Jan Nijtmans <jan dot nijtmans at users dot sf dot net> Jan Nijtmans <jan dot nijtmans at gmail dot com> |
State: | Draft |
Type: | Project |
Tcl-Version: | 8.7 |
Vote: | Pending |
Created: | Tuesday, 23 August 2011 |
Discussions To: | Tcl Core list |
Keywords: | Tcl |
This TIP proposes to add full support for all characters in Unicode 8.0+, inclusive the characters >= U+010000.
In order to extend the range of the characters to more than 16 bits, the type Tcl_UniChar is not big enough any more to hold all possible characters. Changing the type of Tcl_UniChar to a 32-bit quantity is not an option, as it will result in a binary API incompatibility.
The solution proposed in this TIP is to keep Tcl_UniChar a 16-bit quantity, but to increase the value of TCL_UTF_MAX to 4 (from 3). Any conversions from UTF-8 to Tcl_UniChar will convert any valid 4-byte UTF-8 sequence to a sequence of two Surrogate characters. All conversions from UTF-16 to UTF-8 will make sure that any High Surrogate immediately followed by a Low Surrogate will result in a single 4-byte UTF-8 character.
This can be done in a binary compatible way: No source code of existing extensions need to be modified. As long as no characters >= U+010000 or Surrogates are used, all functions provided by the Tcl library will function as before. There are few functions which currently return a value of type Tcl_UniChar, those will be modified to return an int in stead.
As Unicode 8.0, and future Unicode versions, will supply more and more characters outside the 16-bit range, it would be useful if Tcl supports that as well.
This document proposes:
Change the functions Tcl_UniCharToUtf and UnicodeToUtfProc such that when they are fed with a valid High Surrogate followed by a Low Surrogate, the result will be a single 4-byte UTF-8 character.
Change the functions Tcl_UtfToUniChar and UtfToUnicodeProc such that when they are fed with a valid 4-byte UTF-8 character, the first call will return a High Surrogate character, and the next call will return a Low Surrogate character.
The following functions, which currently return a Tcl_UniChar, will be changed to return an int instead:
Tcl_UniCharAtIndex
Tcl_UniCharToLower
Tcl_UniCharToTitle
Tcl_UniCharToUpper
Tcl_GetUniChar
Extend tclUniData.c to include all Unicode 8.0 characters up to U+02FA20. A special case will be made for the functions Tcl_UniCharIsGraph and Tcl_UniCharIsPrint for the characters in the range U+0E0100 - U+0E01EF, otherwise it would almost double the Unicode table size.
As long as no Surrogates or characters >= U+010000 are used, all functions behave exactly the same as before. The only way that Tcl_UniCharToUtf can produce a 4-byte output is when Surrogates or characters >= U+010000 are used.
Extension that want to be compatible with any Tcl version, can include tcl.h as follows:
#define TCL_UTF_MAX 4 #include <tcl.h>
or they can call the C compiler with the additional argument -DTCL_UTF_MAX=4, in order to be sure that UTF-8 representations of length 4 can be handled. This way, the extension can be used with any Tcl version, whether it supports Surrogates or not.
Apart from this, it is advisable to initialize the variable where the chPtr argument from Tcl_UtfToUniChar points to, as this location is used to remember whether the High Surrogate is already produced or not. Not doing so when the first character of a string is a character > U+010000 might result in a Low Surrogate character only. This danger, however unlikely, only exists for the first character in a string, and it only occurs when the (random) value is exactly equal to the expected High Surrogate.
A reference implementation is available at http://core.tcl.tk/tcl in branch tip-389-impl
This document has been placed in the public domain.
[Index] [History] [HTML Format] [Source Format] [LaTeX Format] [Text Format] [XML Format] [*roff Format (experimental)] [RTF Format (experimental)]
TIP AutoGenerator - written by Donal K. Fellows