TIP #413: UNICODE SUPPORT FOR 'STRING IS SPACE' AND 'STRING TRIM' =================================================================== Version: $Revision: 1.5 $ Author: Jan Nijtmans State: Final Type: Project Tcl-Version: 8.6 Vote: Done Created: Monday, 08 October 2012 URL: https://tip.tcl-lang.org413.html Discussions-To: Tcl Core list Post-History: ------------------------------------------------------------------------- ABSTRACT ========== This TIP is in fact a re-consideration of [TIP #318], in that it attempts to define, once and for all, for which characters *string is space* should return 1 and which characters *string trim* should trim. RATIONALE =========== Intuitively, *string is space* and *string trim* should treat the same characters as space, but currently that's not the case, even after the implementation of [TIP #318]. The unicode standard advanced to version 6.2 now (at the time of this writing), but also Java and .NET have their own views on what whitespace should be. Let's try to learn from them. DEFINING THE TCL SPACE SET ---------------------------- The NUL character has the function as string separator, which could be considered in the same group as LINE SEPARATOR (U+2028) and PARAGRAPH SEPARATOR (U+2029). It's a very useful character to be stripped. It even had the Whitespace property in Unicode 2.0. The problem with considering this character as space is that its visible representation is not specified, it even should not occur in normal text. Therefore, it is not in the "Tcl space set", but it is very useful to let it be stripped by *string trim*. The Unicode standard changed in time, which resulted in whitespace characters being removed (deprecated) and added. The /String.Trim()/ method in .NET 3.5 stripped zero width space (U+200B) and zero width no-break space (U+FEFF) from strings, but later .NET versions don't do that any more. The "Tcl space set" should not depend on that: If characters are deprecated in future Unicode versions, and because of that Whitespace properties are changed, they will not be removed from the "Tcl space set". But if new whitespace characters are added in future Unicode standards, they will be added to the "Tcl space set" as well, influencing both *string is space* and *string trim*. The 3 characters that are in the "Tcl space set" but not in the current Unicode whitespace set are discussed now. Most obvious is zero width no-break space (U+FEFF), which is a very useful character to be stripped, as it is used now as Byte Order Mark (BOM). It has no visible representation, and - in fact - no meaning at all within Tcl, as Tcl is UTF-8 internally already. It should not occur anywhere else in the string, but in the past it could as being a zero-width no-break space. It had the /White_Space/ property in Unicode 2.0, but later versions of Unicode do not; the use of the BOM as a space was deprecated. When the use as space was deprecated for (U+FEFF), another character was put forward as replacement for it: word joiner (U+2060). As this character has no visible representation, and has no meaning at all when at the start or the end of a string, it makes sense to include it in the "Tcl space set" as well, the more because its predecessor had the /White_Space/ property. Finally, zero width space (U+200B), had the /White_Space/ property in Unicode 3.0. In the current Unicode Charts it is still listed as being a space, even though the White_Space property was removed later. Therefore it should be in the "Tcl space set" as well. SPECIFICATION =============== This document proposes: * For the ASCII set, *string is space* stays as is. *string trim* will be modified to trim all characters for which *string is space* returns 1, augmented with the NUL character. This means that NUL, VT and FF will be added to the set. This is a *potential incompatibility*. * For characters outside ASCII, the Unicode *White_Space* [] property forms the basis of what *string is space* and *string trim* consider being space. But 3 characters are added to the set: zero with space (U+200B), word joiner (U+2060) and zero width no-break space (U+FEFF) (i.e., the BOM). The *string trimleft* and *string trimright* commands will also be modified, as they track *string trim*. COMPATIBILITY =============== For the ASCII set, the only change is the addition of 3 characters to *string trim*. For Unicode there are more changes, but all added characters are either rarely used, either intuitively expected to be trimmed by *string trim*. I don't think that any code will be adversely affected by this change, it will probably fix more bugs than that it breaks any existing code. ALTERNATIVES ============== 1. NUL could be added to *string is space*, but that would be in conflict with what POSIX /isspace()/ function does. 2. NUL could be left out of the *string trim* set. 3. Additional characters I considered being part of the set: break permitted here (U+0082) no break here (U+0083) zero width joiner (U+200C) zero with non-joiner (U+200D) Those are clearly useful characters to be stripped, as they have no meaning and no visible appearance at the beginning or end of a string. But they are not spaces, so it would diverge the two commands. REFERENCE IMPLEMENTATION ========================== A reference implementation is available in the Tcl fossil repository on the /tip-318-update/ branch []. COPYRIGHT =========== This document has been placed in the public domain. ------------------------------------------------------------------------- TIP AutoGenerator - written by Donal K. Fellows