Chapter 2

Handling Data

Like most programming languages, Tcl comes with facilities for handling many different sorts of data. We’ve already seen Tcl’s facilities for handling numbers and basic arithmetic. In this chapter we will examine all the other sorts of data that can be used in a Tcl program: strings, lists, dictionaries, and associative arrays, and how they can be put to work. We’ll also look at how you can use these basic data ‘types’ to build up more sophisticated data structures suited to your application.

Unlike most programming languages, Tcl does not have a built-in notion of ‘type’. More precisely, all values in Tcl are of a single type: strings. In practice, most programming with data in Tcl is much like programming in other languages: you decide on how some data is to be interpreted (e.g., as a number, or a list) and then use commands that are appropriate for that sort of data. Tcl will take care of making sure that efficient representations are used under-the-hood (e.g., integers are represented as native machine integers).

Tcl’s approach to data structures is to provide a few general purpose structures and to then optimise the implementation for typical uses. This is in contrast to lower-level languages, such as C++ or Java, that provide lots of different data structures for different purposes. Tcl instead aims to simplify the task of programming by picking a few powerful and general data structures. In this regard, Tcl is similar to high-level applications like spreadsheets or relational database systems, that allow the user to specify what they want to achieve and leave it up to the implementation to ensure that most operations are efficient. Of course, you can always design your own custom data structures in Tcl if you need to, but the defaults provided are sufficient for most tasks. In addition, Tcl takes care of memory management and index checking for you, and so is safe from whole classes of errors due to memory mismanagement or ‘buffer overflows’.¹

2.1 Strings

Tcl commands often have subcommands. We’ve already seen an example in the info command. The string command is another example, that contains subcommands for manipulating strings. While all values in Tcl are strings, the string command should be used only on data that you really want to be treated as a string. Examples might include names of people, or addresses. A command with subcommands is known as an “ensemble” command. Some useful string commands include:

string length:

Returns the length of a string, e.g., string length "Hello, World!" returns 13.

string index:

Returns the nth character from a string, counting from 0. For example:

% string index "Hello, World!" 0
H
% string index "Hello, World!" 5
,

You can also index strings from the end using the following syntax:

% string index "Hello, World!" end
!
% string index "Hello, World!" end-2
l

string range:

Returns the range of characters between two indices:

% string range "Hello, World!" 2 end-2
llo, Worl

string equal:

Compares two strings for equality. You can specify whether the comparison should be case sensitive (the default) or not, and also specify a length limit for the comparison (which defaults to the full length of the strings):

% string equal -nocase -length 5 "Hello, World!" "hElLo Tcl!"
1

string compare:

Compares two strings for alphabetical ordering. Each character is compared one at a time, with with ‘a’ being less than ‘b’ and ‘A’ being less than ‘a’. Note that numbers are also compared in this fashion, so that the string ‘10’ will be considered less than ‘2’ as the first character (1) is less than 2. As for string equal, you can specify -nocase and -length options. The command returns -1 if the first string is ‘less’ than the second, 0 if they are equal, or 1 if the second string is less than the first:

% string compare "Hello" "World"
-1
% string compare -nocase apple Apple
0

string first:

Searches for a string within a larger string. If found, it returns the index of the first occurrence, otherwise it returns -1:

% string first "ll" "Hello, World!"
2

You can also specify an optional start index for the search:

% string first "ll" "Hello, World!" 5
-1

string last:

Searches for the last occurrence of a search string:

% string last "ll" "Hello, World!"
2
% string last l "Hello, World!"
10

string wordend:

Returns the index of the character just after the last one in the word that contains the given index. A word is any contiguous set of letters, numbers or underscore characters, or a single other character.

string wordstart:

Returns the index of the first character in the word that contains the index given. You can use these two commands along to extract the word that a given index lies within:

% set str "Hello, World!"
Hello, World!
% set idx 2
2
% string range $str [string wordstart $str $idx]\
    [string wordend $str $idx]-1
Hello

string reverse:

Reverses a string:

% string reverse "Hello, World!"
!dlroW ,olleH

string repeat:

Generates a string consisting of n repetitions of the given string:

% string repeat "Hello! " 5
Hello! Hello! Hello! Hello! Hello!

string replace:

Replace a substring within a larger string. The string to use as a replacement can be omitted, in which case the original string will be deleted. Note that this command, like most Tcl commands, does not alter the original string, but instead returns a new string with the replacement made:

% string replace "Hello, World!" 0 4 Goodbye
Goodbye, World!

string map:

This very useful command performs a bulk search and replace operation on a string. The first argument specifies a list of mappings: a string to find and then a string to replace it with. The command then searches through the given string replacing each search string with its mapping and returns the resulting string:

    % string map {0 "zero " 1 "one " 2 "two "} 01201201202
    zero one two zero one two zero one two zero two

string tolower, string toupper, string totitle:

Converts a string to all lower-case, all upper-case, or title-case, respectively. Title-case means that the first letter is capitalized, while all others are made lower-case.

string trim, string trimleft, string trimright:

Trims any occurrences of a given set of characters from the start or end of a string. The default is to trim all whitespace characters:

% string trim "\tHello, World!   "
Hello, World!
% string trimleft "Hello, World!" "lHe!"
o, World!

Note that the second argument is considered as a set of characters rather than a string.

string is:

Performs input validation on an input string to determine if it can be interpreted as the given class of value. This command is meant for validating user input, rather than for general type checking. In particular, the command will accept the empty string ("") as valid unless the -strict option is given. This command is mostly useful in Tk GUIs (Part II).

A number of other commands are useful for manipulating strings, but are not in the string ensemble:

append:

Appends one or more strings to a variable that contains a string. Note that this command alters a variable rather than just returning a new string:

  % set str "Hello,"
  Hello,
  % append str " World!"
  Hello, World!
  % set str
  Hello, World!

Note that we do not use a dollar-sign when passing the variable str to this command. This is because the command expects the name of a variable containing a string, rather than the string itself. The general rule is that if a command returns a new value then you should use a dollar, but if it manipulates a variable then you should pass the variable name without a dollar sign.

concat:

Concatenates zero or more strings together, inserting spaces between each one. Returns the resulting string:

  % concat "Hello," "World!"
  Hello, World!

scan and format:

These commands can be used to perform conversions between strings and various different data types, while also specifying details of the format, such as maximum length and padding characters. The full details of these commands are too complex for this tutorial, so we will just show some examples of each. The full details are available in manual pages [9, 6]. C programmers will be familiar with these as the scanf and sprintf functions.

  # Parse a simple colour specification of the form #RRGGBB in hex format
  % scan #08D03F "#%2x%2x%2x" r g b
  3
  % puts "$r $g $b"
  8 208 63
  # Format the string back again
  % format "#%02X%02X%02X" $r $g $b
  #08D03F

These two commands are very useful for input validation and output formatting, and we will demonstrate their usage throughout the text. In particular, you should almost always use scan and format when processing numeric input and output.

2.1.1 Unicode

Tcl was one of the first programming languages to adopt Unicode-aware strings throughout the language. To this day, Tcl still has one of the most mature and well-integrated Unicode string implementations, roughly on a par with that of Java. Tcl strings are represented internally as a slightly modified version of UTF-8. Data entering or exiting the Tcl system is translated to and from UTF-8 largely transparently. (The details of how to control this conversion are discussed in Chapter 5). What this means for the Tcl programmer is that they can concentrate on the logic of their application, safe in the knowledge that they can handle input in most languages and character sets in use in the world. All standard Tcl commands that deal with strings (i.e., pretty much all of them!) can handle Tcl’s UTF-8 format, as can all commands that deal with input and output. Most Tcl extensions in common use are also usually Unicode-aware, so, for instance, Tcl XML parsers do actually handle the Unicode requirements of XML, unlike some languages we might care to mention!

There are some current limitations to Tcl’s otherwise excellent Unicode support. Firstly, Tcl currently only supports Unicode code-points up to U+FFFF (the ‘Basic Multilingual Plane’, or BMP). Support for Unicode characters beyond this range is a work-in-progress. The remaining limitations are largely within the Tk graphical user interface extension: for instance, support for rendering bidirectional (‘bidi’) scripts is currently missing. Overall, though, Tcl’s support for Unicode and different character encodings is first class.

To enter arbitrary Unicode characters into a Tcl script, you can use the Unicode escape syntax described in Section 1.4.1:

set str "n\u00E4mlich"

This will result in $str containing the string nämlich. To manually convert between different encodings you can use the encoding command. Note: you should normally never have to use the encoding command directly, as all conversions are usually done by the I/O system automatically.

2.1.2 Pattern Matching

One particularly useful string operation is the string match command that takes a pattern, known as a ‘glob’, and determines if a given string matches the pattern. ‘Globbing’ is the wild-card matching technique that most Unix shells use. In addition to string match, globbing is also used by the switch, lsearch and glob commands, discussed later. The wildcards that string match accepts are:


*: Matches any quantity (including zero) of any characters.
?: Matches one occurrence of any character.
[...]: Matches one occurrence of any character between the brackets. A range of characters can also be specified. For example, [a-z] will match any lower-case letter.
\X: The backslash escapes a glob-special character, such as * or ?, just as in Tcl. This allows you to write glob patterns that match literal glob-sensitive characters, which would otherwise be treated specially.

Some examples are as follows:

% string match f* foo
1
% string match f?? foo
1
% string match f foo
0
% string match {[a-z]*} foo
1
% string match {[a-z]*} Foo
0

The string match command is a great way of determining if some input string matches a particular format. If you have more complex pattern matching requirements, or need to simultaneously extract information as well as match it, then regular expressions (Section 2.6) provide a more powerful (but harder to use) facility.

2.2 Dates and Times

Tcl has sophisticated facilities for dealing with dates and times in the form of the clock command, another ensemble. Tcl represents dates and times as the number of seconds since the epoch time of 1st January, 1970, 00:00 UTC. Note that Tcl’s clock times do not include leap seconds: each UTC day is considered to have exactly 86400 seconds. Instead, Tcl minutely adjusts its clock speed to account for leap seconds. The various clock-related commands are as follows:


clock seconds: Returns the current time as an integer number of seconds since the epoch.
clock milliseconds: Returns the current time as an integer number of milliseconds since the epoch.
clock microseconds: Returns the current time as an integer number of microseconds (millionths of a second) since the epoch.
clock clicks: Returns a system-dependent high-resolution timer value.
clock format: Formats a time value into a human-readable form.
clock scan: Parses a date/time value in some format into a Tcl time value.
clock add: Adds two time values together, accepting a variety of ways of specifying the dates and times.

The format and scan sub-commands accept a -format option that specifies the expected format of the output/input. If not specified, clock format defaults to a format of ‘%a %b %d %H:%M:%S %z %Y’, while clock scan uses ‘free form scan’ mode, in which it attempts to guess the format of the given input string. The possible format specifiers are too numerous to list in this tutorial. Instead, we refer the reader to the official manual page [2]. The -locale and -timezone options can be used to specify which locale and timezone to process time values. The -base option to clock scan specifies a base time to consider all values as relative to. In older versions of Tcl you can use the -gmt 1 option to specify that all processing should be done in UTC. Since Tcl 8.5, this usage is deprecated in favour of -timezone :UTC. Some examples of the use of the clock command:

% clock format [clock seconds] -format "%Y-%m-%d %T"
2009-03-22 20:50:42
% clock format [clock scan "now + 1 year"]
Mon Mar 22 00:00:00 GMT 2010

The clock clicks command returns a system-dependent high-resolution timer. This timer is not guaranteed to be relative to any particular epoch, unlike the other clock commands, and so should only be used for relative timing when the highest-resolution timer that a system supports is needed (such as for benchmarks).

If you want to time how long a particular piece of Tcl code is taking to run (for profiling), then you can use the time command that is designed for just this purpose:

time script ?iterations?

This command runs the given script for the given number of iterations (defaults to 1) and then returns a human-readable string of the form: N microseconds per iteration.

2.3 Lists

The list is a fundamental data structure in Tcl. A list is simply an ordered collection of elements: numbers, words, strings, or other lists. As Tcl is untyped, lists can contain elements of multiple different kinds. Even commands in Tcl are just lists, in which the first element is the name of the command to call, and subsequent elements are the arguments to that command. Tcl lists are implemented in a similar manner to arrays or vectors in languages such as C. That is, Tcl list elements are held in contiguous ranges of memory. This makes indexing and iterating over elements very efficient (O(1) indexing in ‘big-Oh notation’ [13]). Other operations, such as appending and deleting elements are also efficient for most usage patterns due to the way Tcl handles list memory.

Lists can be created in several ways:

As a literal string:

As with all values, Tcl lists are just strings in a certain format. If your list is simple you can just write it as a literal string:

  set myList {"Item 1" "Item 2" "Item 3"}

Note that in this tutorial we often use braces to delimit lists and quotes to delimit other strings. This is just a convention: Tcl will accept either as either a string or a list, so long as it is in the right format.

Using the list command:

The list command takes an arbitrary number of arguments and returns a list whose elements are those arguments:

  set myList [list "Item 1" "Item 2" "Item 3"]

Using the split command:

The split command converts an arbitrary string into a valid list by splitting the string on one of a given set of delimiter characters (which defaults to any whitespace):

  set myList [split "Item 1,Item 2,Item 3" ","]

As for strings, there are a number of commands for accessing and manipulating elements of lists. Unlike string, these list commands are not part of a single ensemble, but are separate commands using a ‘l’ prefix, e.g., lindex, lrange etc.:²

llength:

Returns the length of the list:

% llength {"Item 1" "Item 2" "Item 3"}
3

lindex:

Returns the nth element of a list, counting from 0:

% lindex {a b c d} 2
c

As for strings, you can also use the ‘end-n’ syntax to access elements from the end.

lrange:

Returns all the elements between two indices as a new list:

% lrange {"Item 1" "Item 2" "Item 3"} 1 2
{Item 1} {Item 2}

Note that Tcl leaves off the outer braces when displaying lists, and prefers to use braces rather than quotes to delimit elements. Most Tcl list commands will ‘normalize’ lists in this way. Don’t be fooled though: Tcl always ensures that the elements of lists can be retrieved in exactly the format that they were created. In other words, the following law holds for all list operations:

  [lindex [list $x]] == $x

You should always use the standard Tcl list commands for creating and dismantling lists to ensure that this law continues to hold. Don’t be tempted to use string commands to try and create strings that ‘look like lists’. While this can work, the details are much trickier than they appear at first. All Tcl’s list commands ensure that the lists they produce are always properly formed.

linsert:

Inserts one or more elements into a list, returning the new list with the added elements:

  % linsert {b c d} 0 a
  a b c d
  % linsert {a b c} end d
  a b c d

lreplace:

Replaces one or more elements of a list with zero or more substitute elements. You can also use this to delete elements by specifying no replacements:

  % lreplace {a b c d} 1 2 "Bee" "Cee"
  a Bee Cee d
  % lreplace {a b c d} 1 2
  a d

lreverse:

Reverses a list:

  % lreverse {a b c d}
  d c b a

lrepeat:

Creates a list by repeating a set of elements n times. Note that the order of the elements and the count is reversed from that of the string repeat command, due to historical reasons:

  % lrepeat 3 a b c
  a b c a b c a b c

join:

The inverse of the split operation, this creates a string by joining together elements of a list with the given joining string:

  % join {a b c d} ", "
  a, b, c, d

Note that join and split aren’t complete inverses. In particular, it is not guaranteed that joining a list by a certain delimiter and then splitting that list on the same delimiter will result in the original list. Consider this example:

  % set x [list a b "c,d" e]
  a b c,d e
  % set xstr [join $x ,]
  a,b,c,d,e
  % split $xstr ,
  a b c d e

foreach:

The foreach command can be used to loop through each element of a list, performing an action at each step. The full syntax of this command is detailed in Section 3.2.4:

  % set xs {a b c d}
  a b c d
  % foreach x $xs { puts $x }
  a
  b
  c
  d

2.3.1 List Variable Commands

A number of commands operate on variables containing lists, rather than directly on lists, for reasons of efficiency and convenience, much like the append command for strings. These commands also follow the l-prefix convention:

lappend:

Appends zero or more elements to a list contained in a variable, storing the new list in the same variable:

  % set xs [list a b]
  a b
  % lappend xs c d
  a b c d
  % set xs
  a b c d

lset:

Updates a list in a variable, replacing the element at the given index with a new value. This command will throw an error if the index given is outside of the range of the list (i.e., less than 0 or greater than the final index in the list):

  % set xs [list a b c d]
  a b c d
  % lset xs 1 Bee
  a Bee c d

lassign:

Assigns the elements of a list to the local variables given, returning any unassigned extra elements:

  % set xs [list a b c d]
  a b c d
  % lassign $xs first second
  c d
  % puts "$first, $second"
  a, b

2.3.2 Nested Lists

As well as string and number elements, Tcl lists can also contain other lists as elements. Some Tcl commands, such as lindex, can automatically extract elements from nested lists in a single operation. For instance, we might represent a person as a list of three elements: a name, an address, and a date of birth. The address field is itself a list of strings. For example:

% set jon [list "Jon Doe" [list "A House" "Somewhere"] "1-Jan-1970"]
{Jon Doe} {{A House} Somewhere} 1-Jan-1970

We can then retrieve the first line of Jon’s address using lindex and specifying multiple indices. The indices act like a ‘path’ through the nested list: the first index specifies where to look in the outer list, then the next index where to look in that list, and so on. In this case, the address is the second element of the jon list (i.e., element at index 1), and the first line is then element 0 of the address list, so the complete path is 1 0:

% lindex $jon 1 0
A House

The lset command also recognises nested lists, and can be used to alter variables containing them. For instance, if we wanted to change Jon’s house name, we can use:

% lset jon 1 0 "Jon's House"
{Jon Doe} {{Jon's House} Somewhere} 1-Jan-1970
% lindex $jon 1 0
Jon's House

2.3.3 Sorting and Searching

Tcl builds in powerful facilities for sorting and searching lists of elements, in the form of the lsort and lsearch commands:

lsort ?-option value ...? list

lsearch ?-option value ...? list pattern

Sorting a list rearranges the elements into sequence based on some ordering relation. Tcl’s lsort command is based on an efficient merge sort algorithm, which has O(n log n) performance characteristics (i.e., for a list of length n, lsort is able to sort that list in time proportional to n log n). The sort is stable, meaning that the ordering of elements that compare as equal will be preserved after the sort. By default, lsort sorts elements by comparing their string values, using the same ordering as string compare:

% lsort {b c e g f a d}
a b c d e f g

The comparison function can be changed by specifying one of the following options:


-ascii: Use string comparison with Unicode code-point collation order (the name is historical). This is the default.
-dictionary: Sort using ‘dictionary’ comparison. This is the same as -ascii, except that case is ignored and embedded numbers within the strings are compared as integers rather than as character strings. For example, ‘x10y’ will sort after ‘x2y’. Note though, that negative integers sort without regard for the leading minus sign, so for instance ‘-10’ will be considered ‘greater’ than ‘-5’.
-integer: Treats the elements of the list as integers, and sorts using integer comparisons.
-real: Treats the elements of the list as floating-point numbers.
-command cmd: Allows specifying an arbitrary comparison command. This command will be called with pairs of values (as two separate arguments), and should return an integer: less than 0 to mean that the first argument is less than the second, 0 if equal, and greater than 0 otherwise. For instance, lsort -command {string compare} is equivalent to the default ASCII sorting order.

Tip: The -command option is very slow, as it has to call a Tcl command multiple times during the sort. It is often much faster to massage the list into a format where it can be sorted using one of the built-in comparison functions, typically by creating a collation key. The wiki article “Custom sorting” [3] provides a general solution to this problem, along with much useful discussion of performance optimisations.

A number of other options can be used to change the sorting order:

-increasing

Sorts the elements from smallest to largest (the default).

-decreasing

Sorts the elements from largest to smallest.

-indices

Returns the indices of elements in the list, rather than their values. This is useful if you are sorting one list based on information contained in another list.

-index indexList

This useful option allow you to sort a list of lists based on one particular element in each sub-list. You can even pass multiple indices, in which case they will be treated as a path through each nested sub-list, exactly as for lindex (see Section 2.3.2). For example, if we have a list of people, where each person is represented a list of name, age, we can sort based just on the age using the following command:

  % set people {{Jon 32} {Mary 24} {Mike 31} {Jill 20}}
  {Jon 32} {Mary 24} {Mike 31} {Jill 20}
  % lsort -integer -index 1 $people
  {Jill 20} {Mary 24} {Mike 31} {Jon 32}

-stride strideLength

Since Tcl version 8.6 This option can be used in conjunction with -index to specify that each sub-list is not a nested list, but instead the entire list should be taken as being grouped into strideLength lists of elements. For example, if our list of people was instead represented as a flat list of name-age-name-age..., then we could still sort it using the following command:

  % set people {Jon 32 Mary 24 Mike 31 Jill 20}
  Jon 32 Mary 24 Mike 31 Jill 20
  % lsort -integer -index 1 -stride 2 $people
  Jill 20 Mary 24 Mike 31 Jon 32

-nocase

Causes ASCII/Unicode comparisons to be case-insensitive. Only meaningful with -ascii sort mode.

-unique

Causes duplicate elements to be eliminated from the resulting list. Duplicate elements are those which compare as equal. The last duplicate element from the input list will be the one that is preserved. For example, if -index 0 is specified and the two elements {1 a} and {1 b} are present, in that order, then only {1 b} will be present in the output.

As an example, we can retrieve a list of all commands defined in a Tcl interpreter, sort them, and then print them out, one per line, using the following ‘one-liner’:

puts [join [lsort [info commands]] \n]

As well as sorting lists, you can also search for elements within them using the lsearch command. Like lsort, this command comes with a somewhat bewildering array of options to control its operation. However, its basic usage is very straight-forward: given a list and a pattern to search for, it will return the index of the first element that matches the pattern, or -1 if it cannot be found. By default this uses ‘glob’-style matching, as described in Section 2.1.2:

% lsearch {foo bar jim} b*
1

You can control the way that matching is performed using the following options:

-exact

Use exact string matching.

-glob

Use glob-style matching (the default).

-regexp

Use regular-expression matching (see Section 2.6).

-sorted

Specfies that the elements of the list are in sorted order. This allows the search to use a more efficient algorithm. This option cannot be used with the -glob or -regexp options.

-inline

Causes the command to return the actual value rather than the index:

  % lsearch -inline {foo bar jim} b*
  bar

-all

Causes the command to return a list of every matching index (or value in the case of -inline). Returns an empty list for no matches.

-not

Negates the match, returning the index of the first element that does not match the pattern.

-start index

Causes the search to start from $index.

The lsearch command also supports options for specifying the contents of each element (as for lsort), e.g., -ascii, -integer etc., as well as specifying the sort order of sorted lists, finding the nearest element to a given pattern (if it is not exactly matched), or searching within nested lists.

2.4 Dictionaries

Since Tcl version 8.5

While a list is an ordered collection of elements, a dictionary is an unordered³ mapping from keys to values. Other languages might refer to such data structures as maps or (hash-)tables. Rather than each element having a position and an integer index, as in a list, each element in a dictionary is given a string name. The elements of a Tcl dictionary can be any string value, allowing them to also be used in a similar manner to records or structures in typed languages. The format of a Tcl dictionary is that of a list with an even number of elements: the even indexed elements (0, 2, 4, ...) corresponding to the keys, and the odd indexed elements correspond to the values for those keys. For instance, we can represent a person in the following format:

set jon {
    name    "Jon Doe"
    address {{A House} {Somewhere}}
    dob     1-Jan-1970
}

Viewed as a list, the $jon variable contains 6 elements. Viewed as a dictionary, it contains 3 mappings for the keys name, address, and dob. As well as creating dictionaries using a literal string, you can also use the dict create constructor command, that works like the list command, but creates a dictionary. We can retrieve the value associated with a key using the dict get command:

% dict get $jon dob
1-Jan-1970

You can check whether a dictionary contains a mapping for a given key using the dict exists command:

% dict exists $jon dob
1
% dict exists $jon some-other-key
0

Both of these commands can take a series of keys (as separate arguments), allowing you to access elements in nested dictionaries, in a similar manner to that for nested lists. Some other useful dictionary commands are as follows:

dict size:

Returns the number of key-value mappings in the dictionary.

dict keys:

Returns a list of the keys contained in the dictionary. You can optionally specify a glob-pattern to restrict which keys are returned.

dict values:

Returns a list of the values contained in the dictionary. Again, you can specify a glob-pattern to restrict which values are returned (note: this pattern matches the value, not the associated key).

dict replace:

Returns a new dictionary value with some key/value pairs altered, and possibly new entries add to the dictionary:

  % set jane [dict replace $jon name "Jane Doe" gender female]
  name {Jane Doe} address {{A House} {Somewhere}} dob 1-Jan-1970 
  gender female

dict set:

Manipulates a dictionary-holding variable to change the value under one (possibly nested) key:

  # Change Jon's date-of-birth
  dict set jon dob 2-Jan-1970

dict merge:

Returns a new dictionary that combines the entries from each dictionary given as arguments. If two or more dictionaries contain entries for the same key, then the value of the last dictionary containing that key will be used. For instance, you could use this to implement something like CSS’s cascading style options:

  set default {
      font-weight: normal
      font-shape: roman
  }
  set user {
      font-family: sans-serif
  }
  set author {
      font-family: serif
      font-weight: bold
  }
  # Merge the styles, preferring author over user over defaults
  set style  [dict merge $default $user $author]
  # Extract final values from the combined dictionary
  set family [dict get $style font-family:]
  set weight [dict get $style font-weight:]
  set shape  [dict get $style font-shape:]
  puts "Paragraph font: $family $shape $weight"

dict remove:

Returns a new dictionary with entries for the given keys removed. Note that this command will not error if the given keys do not exist in the dictionary.

dict unset:

Removes a (possibly nested) key from a dictionary contained within a variable, updating the variable in-place.

dict for:

Loops through each key/value pair in the dictionary. This is similar to the foreach loop for lists, but specialised (and optimised) for dictionary values:

  dict for {key value} $jon {
      puts "$key = $value"
  }

In addition to these commands, the dict command also includes a number of convenience commands for manipulating common data in-place within a dictionary. For example, the dict incr command allows efficient in-place incrementing of an integer value stored within a dictionary. For example, we can print a summary of word counts within a text using the following code (ignoring punctuation for now):

set word_counts [dict create]
foreach word [split $text] {
    dict incr word_counts $word
}
# Display the results
dict for {word count} $word_counts {
    puts [format "%-30s : %d" $word $count]
}

You can sort a dictionary using lsort, but not with the -dictionary option! (The name clash is accidental). Instead, you can use the -stride and -index options, taking advantage of the fact that all dictionaries are also lists:

set sorted_words [lsort -integer -stride 2 -index 1 $word_counts]

2.4.1 Filtering a Dictionary

The dict command comes with powerful functionality for filtering a dictionary value to select just those key-value pairs that match some criteria. The dict filter command supports three different forms of filtering:

key and value filtering returns those key-value pairs whose key or value (respectively) matches one of a set of glob patterns, similar to lsearch.
script filtering allows a script to be run on each key-value pair of the dictionary, and includes only those elements for which the script returns a true value.

For example, given our dictionary of word counts from the previous section, we can return the counts of all words beginning with ‘a’ using:

set a_words [dict filter $word_counts key a*]

Or we could return all words with a count in double figures:

set d_words [dict filter $word_counts value ??]

Finally, we could return all words with a count greater than 15 using:

set frequent [dict filter $word_counts script {key value} {
  expr {$value > 15} 
}]

2.4.2 Updating a Dictionary

The dict command also includes a very convenient feature for updating the contents of a dictionary variable in-place using an arbitrary script. The dict update and dict with commands unpack a dictionary value into a set of local variables with the names and values given by the contents of the dictionary. A script can then be executed that manipulates these variables. At the end of the script, any changes are read back from the variables and the corresponding changes are made to the original dictionary. For example, if we consider our ‘jon’ dictionary from the start of Section 2.4:

set jon {
    name    "Jon Doe"
    address {{A House} {Somewhere}}
    dob     1-Jan-1970
}

We can use the dict with command to unpack this data and manipulate it more concisely than by using individual dict get and dict set commands:

dict with jon {
    puts "Name: $name"
    puts "Addr: [join $address {, }]"
    puts "DOB : $dob"

    # Change Jon's name
    set name "Other Jon"
}
puts "Jon's name is now: [dict get $jon name]"

The dict update command works in a similar fashion, except that it allows you to specify exactly which entries should become variables and also to specify what those variable names should be (rather than just using the key name). This approach can be used to make batch manipulations to a dictionary value, using the full range of Tcl commands for manipulating ordinary variables.

Tip: While this update facility is very useful and convenient, there are a few ‘gotchas’ to be aware of. Firstly, you should avoid using the dict with command with dictionary values that may come from user input as this could allow a malicious user to override private variables with incorrect information. Instead, always use the dict update command so that you can control exactly which variables will be created or updated. Secondly, be aware that entries from the dictionary could accidentally overwrite the dictionary variable itself, causing an error. For example, suppose our ‘jon’ dictionary happened to contain a key called ‘jon’. In that case, dict with would overwrite our initial dictionary with the contents of the jon key, either creating an error or (even worse) silently destroying some information:

% dict set jon jon "garbage"
...
% dict with jon { }
missing value to go with key
% set jon
garbage

2.5 Arrays

Tcl’s arrays are similar in some respects to dictionaries: they are unordered mappings from string keys to some value.⁴ In contrast to dictionaries, however, arrays map from string keys to variables rather than values. This means that arrays are not themselves first-class values in Tcl and cannot be passed to and from procedures without some extra work. On the other hand, arrays provide a convenient syntax and can be updated in-place just like other variables. It is an error to have an array variable and a normal variable with the same name.

Array variables can be created and manipulated just like regular variables, but using the special array syntax, which consists of a normal variable name followed by a string key in parentheses. For example, to create an array variable named ‘jon’ containing name, address, and date-of-birth fields, we would write:

set jon(name)       "Jon Doe"
set jon(address)    {"A House" "Somewhere"}
set jon(dob)        1-Jan-1970

The array set command can also be used to create an array in one go:

array set jon {
    name            "Jon Doe"
    address         {"A House" Somewhere}
    dob             1-Jan-1970
}

Once an array has been created, we can use the array syntax to manipulate its elements using normal variable commands. An array entry can be used pretty much everywhere that a normal variable can be:

puts "Name: $jon(name)"
set jon(name) "Other Jon"
lappend jon(address) "Planet Earth"

The parray command can be used to display the contents of an array variable:

% parray jon
jon(address) = {A House} Somewhere {Planet Earth}
jon(dob)     = 1-Jan-1970
jon(name)    = Other Jon

Given the apparent similarity between arrays and dictionaries, it may not seem obvious when you should use one or the other. In fact, the two are quite different things and are useful in different situations. A dictionary should be used when your primary objective is to create a data structure which is naturally modelled as a collection of named fields. Using a dictionary gives you all the advantages of a normal Tcl value: it can be passed to and from procedures and other commands, it can be sent over communication channels (covered in Chapter 5), and it can be easily inserted into other data structures. Tcl also automatically manages the memory for dictionaries, as for other values, so you do not need to worry about freeing up resources when you have finished with it. An array, on the other hand, is a complex stateful entity, and is best used when you want to model a long-lived stateful component in your application. For instance, it can be very useful to associate an array with a graphical user interface component, to hold current information on the state of the display and user interactions. Such information typically lasts a long time and is constantly changing in response to user inputs. A dictionary could also be used in this situation, but an array as a collection of variables brings advantages such as variable traces that really shine in these kinds of situations.

2.6 Regular Expressions

Regular expressions are a compact means of expressing complex patterns for matching strings and extracting information. If you are not already familiar with regular expressions from other languages, such as Perl, then they can be quite daunting at first, with their terse and cryptic syntax. However, once you have mastered the basics they soon become an indispensable tool that you will wonder how you ever managed without! This tutorial will give only a very brief introduction to the basic features of regular expressions and the support provided for them in Tcl. There are many tutorials and good books devoted to the subject available on the Web and in technical bookstores if you wish to learn more. Readers who have come across regular expressions in other languages will find that Tcl uses the familiar POSIX-style syntax for basic regular expressions. Note that Tcl does not use the popular Perl-Compatible Regular Expression (PCRE) syntax used in several modern programming languages, so some advanced features may be slightly different.⁵

Before getting into the technicalities of regular expressions (REs) in Tcl, it is worth pointing out a little of the theoretical origins and practical limitations. A regular expression is so-called because it is capable of matching a regular language. Such languages are relatively simple in the grand scheme of things: no popular computer programming language has a grammar that simple, for instance. As an example, no regular expression is capable of matching a string only if the braces within it are balanced (i.e., each open brace is paired with exactly one close brace). It is therefore good to be aware of the limits of regular expressions, and when a more appropriate technology should be used. A classic example is extracting information from XML or HTML documents: at first it can look as if a simple RE will do the trick, but it usually then becomes apparent that XML is more complicated than it looks. Soon you end up with a large and unwieldy RE that still doesn’t match every example. The solution is to use a proper XML parser. However, this is not to say that regular expressions are not a powerful tool. Firstly, almost all regular expression engines—and Tcl is no exception here—implement more than the basic regular expressions (BREs) of theory. Various extensions to REs allow writing quite complex patterns quite succinctly, making them a handy tool to have in your toolbox. Even when your problem requires more than a RE can provide, regular expressions can still help when building a more comprehensive parser. In terms of parsing theory, regular expressions are most useful for lexical analysis while more sophisticated tools (such as recursive descent parsers) are more suited to syntactic analysis.

Tcl supports regular expressions primarily through two commands:

regexp ?options? RE string ?matchVar ...?

regsub ?options? RE string subSpec ?varName?

The regexp command searches string for the pattern RE. If it matches, then the command returns the index at which it matched, and optionally copies the matched string, and any sub-matches, into one or more named variables supplied at the end. The regsub command also matches a regular expression against a string, but in this case is substitutes the matching string with the string given in subSpec. Table 3 gives the basic elements of regular expression syntax.

`^`	Matches the beginning of the string.
`$`	Matches the end of the string.
`.`	Matches any single character.
`[...]`	Matches any character in the set between the brackets.
`[a-z]`	Matches any character in the range a..z.
`[^...]`	Matches any character not in the set given.
`(...)`	Groups a pattern into a sub-pattern.
`p\|q`	Matches pattern p or pattern q.
`*`	Matches any count (0–n) of the previous pattern.
`+`	Matches at least 1 occurrence of the previous pattern.
`?`	Matches 0 or 1 occurrence of the previous pattern.
`{n}`	Matches exactly n occurrences of the previous pattern.
`{n,m}`	Matches between n and m occurrences of the previous pattern.

Table 3: Basic regular expression syntax.

Regular expressions are similar to the glob-matching that was discussed in Section 2.1.2. The main difference is the way that sets of matched characters are handled. In globbing the only way to select sets of unknown text is the * symbol, which matches any quantity of any character. In regular expression matching, this is much more precise. Instead of using such ‘wild-card’ matching, you instead can say exactly which characters you wish to match and then apply a modifier to this pattern to say how many times you wish to match it. For example, the RE a* will match 0, 1 or any number of a characters in a sequence. Modifiers can also be applied to sub-patterns, not just individual characters, so that [abc]* will match strings of any length containing just the characters a, b, or c (e.g., ‘bbacbaababc’). Another way of writing this pattern would be to use a character range: [a-c]*, which again matches strings of any length (including empty strings) consisting of just characters between a and c in the Unicode alphabet. You can also create negated patterns which match anything that isn’t in a certain set of characters, for example: [^a-c] will match any character that isn’t a, b, or c. As elsewhere in Tcl, a backslash character can be used to escape any of this special syntax to include literal asterisks and so on in a pattern. For instance, to match any number of dots, we can use \.*. Here are some examples of regular expressions:

# Match an Internet Protocol (IP4) address in dotted-form:
set re(ip)  {([0-9]{1,3}\.){3}[0-9]{1,3}}
# Example:
regexp $re(ip) 192.168.0.1 ;# returns 1 (i.e., matched)

# Match a Web URL
set re(web) {^https?://([^:/]+)(:[0-9]+)?(/.*)?$}
regexp $re(web) http://www.tcl.tk:80/docs/ ;# matches
regexp $re(web) www.slashdot.org ;# no match

# Match an email address
set re(email) {^([^@]+)@(.*\.(com|org|net))$}
regexp $re(email) [email protected] ;# matches

As you can see from these examples, regular expressions are both difficult to read, but also capable of performing quite complex matching. Let’s examine one of these patterns in more depth to see what is going on. If we take the Web URL pattern, we can break it apart step by step:

Firstly, the entire pattern is surrounded by a ^...$ pair of anchors. These ensure that the pattern only matches the entire string that is given, as ^ will only match at the start of the string, and $ will only match at the end. This is a common pattern which you will see on many regular expressions.
The first part of the pattern then matches against the protocol part of the URL: the initial http:// bit that tells us that this is a HyperText Transfer Protocol (HTTP) link. The pattern here is just a literal string, with the exception that we also allow secure HTTPS URLs using the optional pattern s?. Note that as this pattern follows the initial ^ anchor, it will only match as the very first characters of the input string.
The next part of a link is the host name of the server we wish to connect to, for instance wiki.tcl.tk. We could try and specify exactly which characters can appear in this part of the URL, but it is often easier to specify those which cannot: in this case, we know that the colon and forward-slash characters are forbidden in host names, and so we can use a negated pattern to match anything but these: [^:/]. As the host name is not optional, we want to match at least one such character, but with no upper limit on the size. Therefore we use the + modifier. Finally, we group this pattern into a sub-pattern to allow us to capture the host name when matching: ([^:/]+).
A HTTP URL can have an optional port number following the host name. This is simply a colon followed by a simple positive integer, which we can match as just :[0-9]+ (a colon followed by 1 or more digits). We again make this into a sub-pattern for capture and also make the entire sub-pattern optional: (:[0-9]+)?.
Finally, we also match the path of the requested document on the target server. This part is also optional, and we define it as simply anything else remaining at the end of the URL following an initial forward-slash: (/.*)?.
Putting everything together, we arrive at our final pattern:
```
  ^https?://([^:/]+)(:[0-9]+)?(/.*)?$
  
```

Beyond simple matching, REs are also capable of extracting information from a string while it is being matched. In particular, any sub-pattern in a RE surrounded by parentheses can be extracted into a variable by the regexp command:

regexp $re(web) https://example.com:8080/index.php match host port path

After this command, the match variable will contain the full string that was matched (i.e., https://example.com:8080/index.php), while the host, port, and path variables will each contain their respective parts of the URL: i.e., example.com, 8080 and /index.php respectively. The regexp command can also be used to count the number of occurrences of a given pattern, by passing the -all option as an initial argument:

puts "Number of words: [regexp -all {[^ ]+} $text]"

The regsub command allows substitutions to be performed on a string based on the matching of a regular expression pattern, either returning the modified string or saving it in a new variable. The substitution string can itself refer to elements of the matched RE pattern, by using one or more substitution escapes of the form \N where N is a number between 0 and 9: \0 will be replaced with the string that matched the entire RE, \1 with the string that matched the first sub-pattern, and so on. You can also use the symbol & in place of \0. For instance, if you want to take a plain text file and convert it to HTML by making every instance of your own name highlighted in bold, you could achieve that with a single regsub command:

regsub -all {Neil Madden} $text {<b>\0</b>} html

2.6.1 Crafting Regular Expressions

Regular expressions provide a very powerful method of defining a pattern, but they are a bit awkward to understand and to use properly. In this section we will examine some more examples in detail to help the reader develop an understanding of how to use this technology to best effect. We start with a simple yet non-trivial example: finding floating-point numbers in a line of text. Do not worry: we will keep the problem simple than it is in its full generality. We only consider numbers like 1.0 and not 1.00e+01.

How do we design our regular expression for this problem? By examining typical examples of the strings we want to match:

Examples of valid numbers are: 1.0, .02, + 0., 1, + 1, - 0.0120.
Examples of invalid numbers (that is, strings we do not want to recognise as numbers but which superficially look like them): - , + , 0.0.1, 0..2, + + 1.
Questionable numbers are: + 0000 and 0001. We will accept them—because they normally are accepted and because excluding them makes our pattern more complicated.

A pattern is beginning to emerge:

A number can start with a sign ( - or + ) or with a digit. This can be captured with the pattern [-+]?, which matches a single - , and single + or nothing.
A number can have zero or more digits in front of a single period (.) and it can have zero or more digits following the period. Perhaps [0-9]*\.[0-9]* will do ...
A number may not contain a period at all. So, revise the previous expression to: [0-9]*\.?[0-9]*

The complete expression is:

[-+]?[0-9]*\.?[0-9]*

At this point we can do three things:

Try the expression with a bunch of examples like the ones above and see if the proper ones match and the others do not.
Try to make the expression look nicer, before we start testing it. For instance, the class of characters [0-9] is so common that it has a shortcut: \d. So, we could settle for [-+]?\d*\.?\d* instead. Or we could decide that we want to capture the digits before and after the period for special processing: [-+]?(\d*)\.?(\d*)
Or (and this is a good strategy in general!), we can carefully examine the pattern before we start actually using it.

You may have noticed a problem with the pattern created above: all the parts are optional. That is, each part can match a null string—no sign, no digits before the period, no period, no digits after the period. In other words, our pattern can match an empty string! Our questionable numbers, like ‘+000’ will be perfectly acceptable and we (grudgingly) agree. But, more surprisingly, the strings ‘–1’ and ‘A1B2’ will be accepted too! Why? Because the pattern can start anywhere in the string, so it would match the substrings ‘-1’ and ‘1’ respectively. We need to reconsider our pattern—it is too simple, too permissive:

The character before a minus or a plus, if there is any, cannot be another digit, period or a minus or plus. Let is make it just a space or a tab or the beginning of the string: ^|[ \t]. This may look a bit strange, but what it says is: either the beginning of the string (^ outside the square brackets) or (the vertical bar) a space or tab (remember: the string \t represents the tab character).
Any sequence of digits before the period (if there is one) is allowed: \d+\.?
There may be zero digits in front of the period, but then there must be at least one behind it: \.\d+
And, of course, digits in front and behind the period: \d+\.\d+
The character after the string (if any) can not be a + , - or ‘.’ as that would get us into the unacceptable number-like strings: $|[^+-.] (the dollar sign signifies the end of the string).

Before trying to write down the complete regular expression, let us see what different forms we have:

No period: [-+]?\d+
A period without digits before it: [-+]?\.\d+
Digits before a period, and possibly digits after it: [-+]?\d+\.\d*

Now the synthesis:

(^|[ \t])([-+]?(\d+|\.\d+|\d+\.\d*))($|[^+-.])

Let’s put it to the test:

set pattern {(^|[ \t])([-+]?(\d+|\.\d+|\d+\.\d*))($|[^+-.])}
set examples {
    "1.0"   "  .02"     "  +0."
    "1"     "+1"        " -0.0120"
    "+0000" " - "       "+."
    "0001"  "0..2"      "++1"
    "A1.0B" "A1"
}
foreach e $examples {
    if {[regexp $pattern $e whole _ number digits _]} {
        puts "PASS: >>$e<<: $number ($whole)"
    } else {
        puts "FAIL: >>$e<<: Does not contain a valid number"
    }
}

The result is:

PASS: >>1.0<<: 1.0 (1.0)
PASS: >>  .02<<: .02 ( .02)
PASS: >>  +0.<<: +0. ( +0.)
PASS: >>1<<: 1 (1)
PASS: >>+1<<: +1 (+1)
PASS: >> -0.0120<<: -0.0120 ( -0.0120)
PASS: >>+0000<<: +0000 (+0000)
FAIL: >> - <<: Does not contain a valid number
FAIL: >>+.<<: Does not contain a valid number
PASS: >>0001<<: 0001 (0001)
FAIL: >>0..2<<: Does not contain a valid number
FAIL: >>++1<<: Does not contain a valid number
FAIL: >>A1.0B<<: Does not contain a valid number
FAIL: >>A1<<: Does not contain a valid number

Our pattern correctly accepts the strings we intended to be recognised as numbers and rejects the others. See if you can adapt it to accept more valid forms of numbers, such as scientific notation (1.03e-2) or hexadecimal (0x12).

Here are a few other examples of uses for regular expressions:

Text enclosed in quotes: This is ‘quoted text’. If we know what character will be used for the quotes (e.g., a double-quote) then a simple pattern will do: "([^"]*)". If we do not know the enclosing character (it can be " or ') then we can use a so-called ‘back-reference’ to the first captured sub-string:
```
  regexp {(["'])[^"']*\1} $string
  
```
The \1 reference is replaced with whichever character matched at the start of the string.

You can use this technique to see if a word occurs twice in the same line of text⁶:

  set string "Again and again and again ..."
  if {[regexp {(\y\w+\y).+\1} $string -> word]} {
      puts "The word $word occurs at least twice"
  }

Suppose you need to check the parentheses in some mathematical expression, such as (1 + a)/(1 - b× x). A simple check is to count the open and close parentheses:
```
  if {[regexp -all {(} $exp] != [regexp -all {)} $exp]} {
      puts "Parentheses unbalanced!"
  }
  
```
Of course, this is just a rough check. A better one is to see if at any point while scanning the string there are more close parentheses than open ones. We can easily extract the parentheses into a list and then check that (the -inline option helps here):
```
  set parens [regexp -all -inline {[()]} $exp]
  set counts [string map {( 1 ) -1} $parens]
  set balance 0
  foreach c $counts {
      if {[incr balance $c] < 0} { puts "Unbalanced!" }
  }
  if {$balance != 0} { puts "Unbalanced!" }
  
```
We use a trick here by replacing the parentheses in the list with either +1 or -1. This allows us to quickly sum the list to check for balanced parentheses.

A number of tools are available to help experimenting with regular expressions, and a number of books cover the topic in depth, such as J. Friedl’s “Mastering Regular Expressions” [7].

¹ For the technically minded, Tcl uses a form of reference-counting for memory management. This suffices for the majority of cases, and simplifies the C interface.

² The reason for this discrepancy is largely historical, as the list command was already taken as a list constructor.

³ Actually, dictionaries preserve the order of elements as they are given, adding any new keys to the end.

⁴ This may be confusing to users from other languages where an ‘array’ is usually indexed by an integer rather than a string. Tcl’s arrays are really ‘associative arrays’ or hashtables.

⁵ It is also worth noting that Tcl’s regular expression engine, written by Henry Spencer, is of a different design to most other languages, with different performance characteristics. Patterns that are slow in some implementations may be faster in Tcl, and vice-versa.

⁶ This RE uses some advanced features: the y anchor matches only at the beginning or end of a word, and w matches any alpha-numeric character. The -> symbol is not a new bit of syntax but just a handy name for the variable that captures the entire match. This idiom is used frequently in Tcl code to indicate that the whole match was not required.