Like most programming languages, Tcl comes with facilities for handling many different sorts of data. We’ve already seen Tcl’s facilities for handling numbers and basic arithmetic. In this chapter we will examine all the other sorts of data that can be used in a Tcl program: strings, lists, dictionaries, and associative arrays, and how they can be put to work. We’ll also look at how you can use these basic data ‘types’ to build up more sophisticated data structures suited to your application.
Unlike most programming languages, Tcl does not have a built-in notion of ‘type’. More precisely, all values in Tcl are of a single type: strings. In practice, most programming with data in Tcl is much like programming in other languages: you decide on how some data is to be interpreted (e.g., as a number, or a list) and then use commands that are appropriate for that sort of data. Tcl will take care of making sure that efficient representations are used under-the-hood (e.g., integers are represented as native machine integers).
Tcl’s approach to data structures is to provide a few general purpose structures and to then optimise the implementation for typical uses. This is in contrast to lower-level languages, such as C++ or Java, that provide lots of different data structures for different purposes. Tcl instead aims to simplify the task of programming by picking a few powerful and general data structures. In this regard, Tcl is similar to high-level applications like spreadsheets or relational database systems, that allow the user to specify what they want to achieve and leave it up to the implementation to ensure that most operations are efficient. Of course, you can always design your own custom data structures in Tcl if you need to, but the defaults provided are sufficient for most tasks. In addition, Tcl takes care of memory management and index checking for you, and so is safe from whole classes of errors due to memory mismanagement or ‘buffer overflows’.1
Tcl commands often have subcommands. We’ve already seen an example in the info command. The string command is another example, that contains subcommands for manipulating strings. While all values in Tcl are strings, the string command should be used only on data that you really want to be treated as a string. Examples might include names of people, or addresses. A command with subcommands is known as an “ensemble” command. Some useful string commands include:
string length "Hello, World!"
returns 13.
% string index "Hello, World!" 0 H % string index "Hello, World!" 5 ,
You can also index strings from the end using the following syntax:
% string index "Hello, World!" end ! % string index "Hello, World!" end-2 l
% string range "Hello, World!" 2 end-2 llo, Worl
% string equal -nocase -length 5 "Hello, World!" "hElLo Tcl!" 1
% string compare "Hello" "World" -1 % string compare -nocase apple Apple 0
% string first "ll" "Hello, World!" 2
You can also specify an optional start index for the search:
% string first "ll" "Hello, World!" 5 -1
% string last "ll" "Hello, World!" 2 % string last l "Hello, World!" 10
% set str "Hello, World!" Hello, World! % set idx 2 2 % string range $str [string wordstart $str $idx]\ [string wordend $str $idx]-1 Hello
% string reverse "Hello, World!" !dlroW ,olleH
% string repeat "Hello! " 5 Hello! Hello! Hello! Hello! Hello!
% string replace "Hello, World!" 0 4 Goodbye Goodbye, World!
% string map {0 "zero " 1 "one " 2 "two "} 01201201202 zero one two zero one two zero one two zero two
% string trim "\tHello, World! " Hello, World! % string trimleft "Hello, World!" "lHe!" o, World!
Note that the second argument is considered as a set of characters rather than a string.
""
) as valid
unless the -strict option is given. This command is mostly useful
in Tk GUIs (Part II).
A number of other commands are useful for manipulating strings, but are not in the string ensemble:
% set str "Hello," Hello, % append str " World!" Hello, World! % set str Hello, World!
Note that we do not use a dollar-sign when passing the variable str to this command. This is because the command expects the name of a variable containing a string, rather than the string itself. The general rule is that if a command returns a new value then you should use a dollar, but if it manipulates a variable then you should pass the variable name without a dollar sign.
% concat "Hello," "World!" Hello, World!
# Parse a simple colour specification of the form #RRGGBB in hex format % scan #08D03F "#%2x%2x%2x" r g b 3 % puts "$r $g $b" 8 208 63 # Format the string back again % format "#%02X%02X%02X" $r $g $b #08D03F
These two commands are very useful for input validation and output formatting, and we will demonstrate their usage throughout the text. In particular, you should almost always use scan and format when processing numeric input and output.
Tcl was one of the first programming languages to adopt Unicode-aware strings throughout the language. To this day, Tcl still has one of the most mature and well-integrated Unicode string implementations, roughly on a par with that of Java. Tcl strings are represented internally as a slightly modified version of UTF-8. Data entering or exiting the Tcl system is translated to and from UTF-8 largely transparently. (The details of how to control this conversion are discussed in Chapter 5). What this means for the Tcl programmer is that they can concentrate on the logic of their application, safe in the knowledge that they can handle input in most languages and character sets in use in the world. All standard Tcl commands that deal with strings (i.e., pretty much all of them!) can handle Tcl’s UTF-8 format, as can all commands that deal with input and output. Most Tcl extensions in common use are also usually Unicode-aware, so, for instance, Tcl XML parsers do actually handle the Unicode requirements of XML, unlike some languages we might care to mention!
There are some current limitations to Tcl’s otherwise excellent Unicode support. Firstly, Tcl currently only supports Unicode code-points up to U+FFFF (the ‘Basic Multilingual Plane’, or BMP). Support for Unicode characters beyond this range is a work-in-progress. The remaining limitations are largely within the Tk graphical user interface extension: for instance, support for rendering bidirectional (‘bidi’) scripts is currently missing. Overall, though, Tcl’s support for Unicode and different character encodings is first class.
To enter arbitrary Unicode characters into a Tcl script, you can use the Unicode escape syntax described in Section 1.4.1:
set str "n\u00E4mlich"
This will result in $str containing the string nämlich. To manually convert between different encodings you can use the encoding command. Note: you should normally never have to use the encoding command directly, as all conversions are usually done by the I/O system automatically.
One particularly useful string operation is the string match command that takes a pattern, known as a ‘glob’, and determines if a given string matches the pattern. ‘Globbing’ is the wild-card matching technique that most Unix shells use. In addition to string match, globbing is also used by the switch, lsearch and glob commands, discussed later. The wildcards that string match accepts are:
Some examples are as follows:
% string match f* foo 1 % string match f?? foo 1 % string match f foo 0 % string match {[a-z]*} foo 1 % string match {[a-z]*} Foo 0
The string match command is a great way of determining if some input string matches a particular format. If you have more complex pattern matching requirements, or need to simultaneously extract information as well as match it, then regular expressions (Section 2.6) provide a more powerful (but harder to use) facility.
Tcl has sophisticated facilities for dealing with dates and times in the form of the clock command, another ensemble. Tcl represents dates and times as the number of seconds since the epoch time of 1st January, 1970, 00:00 UTC. Note that Tcl’s clock times do not include leap seconds: each UTC day is considered to have exactly 86400 seconds. Instead, Tcl minutely adjusts its clock speed to account for leap seconds. The various clock-related commands are as follows:
The format and scan sub-commands accept a -format option that specifies the expected format of the output/input. If not specified, clock format defaults to a format of ‘%a %b %d %H:%M:%S %z %Y’, while clock scan uses ‘free form scan’ mode, in which it attempts to guess the format of the given input string. The possible format specifiers are too numerous to list in this tutorial. Instead, we refer the reader to the official manual page [2]. The -locale and -timezone options can be used to specify which locale and timezone to process time values. The -base option to clock scan specifies a base time to consider all values as relative to. In older versions of Tcl you can use the -gmt 1 option to specify that all processing should be done in UTC. Since Tcl 8.5, this usage is deprecated in favour of -timezone :UTC. Some examples of the use of the clock command:
% clock format [clock seconds] -format "%Y-%m-%d %T" 2009-03-22 20:50:42 % clock format [clock scan "now + 1 year"] Mon Mar 22 00:00:00 GMT 2010
The clock clicks command returns a system-dependent high-resolution timer. This timer is not guaranteed to be relative to any particular epoch, unlike the other clock commands, and so should only be used for relative timing when the highest-resolution timer that a system supports is needed (such as for benchmarks).
If you want to time how long a particular piece of Tcl code is taking to run (for profiling), then you can use the time command that is designed for just this purpose:
time script ?iterations? |
The list is a fundamental data structure in Tcl. A list is simply an ordered collection of elements: numbers, words, strings, or other lists. As Tcl is untyped, lists can contain elements of multiple different kinds. Even commands in Tcl are just lists, in which the first element is the name of the command to call, and subsequent elements are the arguments to that command. Tcl lists are implemented in a similar manner to arrays or vectors in languages such as C. That is, Tcl list elements are held in contiguous ranges of memory. This makes indexing and iterating over elements very efficient (O(1) indexing in ‘big-Oh notation’ [13]). Other operations, such as appending and deleting elements are also efficient for most usage patterns due to the way Tcl handles list memory.
Lists can be created in several ways:
set myList {"Item 1" "Item 2" "Item 3"}
Note that in this tutorial we often use braces to delimit lists and quotes to delimit other strings. This is just a convention: Tcl will accept either as either a string or a list, so long as it is in the right format.
set myList [list "Item 1" "Item 2" "Item 3"]
set myList [split "Item 1,Item 2,Item 3" ","]
As for strings, there are a number of commands for accessing and manipulating elements of lists. Unlike string, these list commands are not part of a single ensemble, but are separate commands using a ‘l’ prefix, e.g., lindex, lrange etc.:2
% llength {"Item 1" "Item 2" "Item 3"} 3
% lindex {a b c d} 2 c
As for strings, you can also use the ‘end-n’ syntax to access elements from the end.
% lrange {"Item 1" "Item 2" "Item 3"} 1 2 {Item 1} {Item 2}
Note that Tcl leaves off the outer braces when displaying lists, and prefers to use braces rather than quotes to delimit elements. Most Tcl list commands will ‘normalize’ lists in this way. Don’t be fooled though: Tcl always ensures that the elements of lists can be retrieved in exactly the format that they were created. In other words, the following law holds for all list operations:
[lindex [list $x]] == $x
You should always use the standard Tcl list commands for creating and dismantling lists to ensure that this law continues to hold. Don’t be tempted to use string commands to try and create strings that ‘look like lists’. While this can work, the details are much trickier than they appear at first. All Tcl’s list commands ensure that the lists they produce are always properly formed.
% linsert {b c d} 0 a a b c d % linsert {a b c} end d a b c d
% lreplace {a b c d} 1 2 "Bee" "Cee" a Bee Cee d % lreplace {a b c d} 1 2 a d
% lreverse {a b c d} d c b a
% lrepeat 3 a b c a b c a b c a b c
% join {a b c d} ", " a, b, c, d
Note that join and split aren’t complete inverses. In particular, it is not guaranteed that joining a list by a certain delimiter and then splitting that list on the same delimiter will result in the original list. Consider this example:
% set x [list a b "c,d" e] a b c,d e % set xstr [join $x ,] a,b,c,d,e % split $xstr , a b c d e
% set xs {a b c d} a b c d % foreach x $xs { puts $x } a b c d
A number of commands operate on variables containing lists, rather than directly on lists, for reasons of efficiency and convenience, much like the append command for strings. These commands also follow the l-prefix convention:
% set xs [list a b] a b % lappend xs c d a b c d % set xs a b c d
% set xs [list a b c d] a b c d % lset xs 1 Bee a Bee c d
% set xs [list a b c d] a b c d % lassign $xs first second c d % puts "$first, $second" a, b
As well as string and number elements, Tcl lists can also contain other lists as elements. Some Tcl commands, such as lindex, can automatically extract elements from nested lists in a single operation. For instance, we might represent a person as a list of three elements: a name, an address, and a date of birth. The address field is itself a list of strings. For example:
% set jon [list "Jon Doe" [list "A House" "Somewhere"] "1-Jan-1970"] {Jon Doe} {{A House} Somewhere} 1-Jan-1970
We can then retrieve the first line of Jon’s address using lindex and specifying multiple indices. The indices act like a ‘path’ through the nested list: the first index specifies where to look in the outer list, then the next index where to look in that list, and so on. In this case, the address is the second element of the jon list (i.e., element at index 1), and the first line is then element 0 of the address list, so the complete path is 1 0:
% lindex $jon 1 0 A House
The lset command also recognises nested lists, and can be used to alter variables containing them. For instance, if we wanted to change Jon’s house name, we can use:
% lset jon 1 0 "Jon's House" {Jon Doe} {{Jon's House} Somewhere} 1-Jan-1970 % lindex $jon 1 0 Jon's House
Tcl builds in powerful facilities for sorting and searching lists of elements, in the form of the lsort and lsearch commands:
lsort ?-option value ...? list |
lsearch ?-option value ...? list pattern |
% lsort {b c e g f a d} a b c d e f g
The comparison function can be changed by specifying one of the following options:
Tip:
The -command option is very slow, as it has to call a Tcl command
multiple times during the sort. It is often much faster to massage the list into
a format where it can be sorted using one of the built-in comparison functions,
typically by creating a collation key. The wiki article “Custom
sorting” [3] provides a general solution to this problem,
along with much useful discussion of performance optimisations.
|
% set people {{Jon 32} {Mary 24} {Mike 31} {Jill 20}} {Jon 32} {Mary 24} {Mike 31} {Jill 20} % lsort -integer -index 1 $people {Jill 20} {Mary 24} {Mike 31} {Jon 32}
% set people {Jon 32 Mary 24 Mike 31 Jill 20} Jon 32 Mary 24 Mike 31 Jill 20 % lsort -integer -index 1 -stride 2 $people Jill 20 Mary 24 Mike 31 Jon 32
{1 a}
and {1 b}
are present, in that order, then only {1 b}
will be
present in the output.
As an example, we can retrieve a list of all commands defined in a Tcl interpreter, sort them, and then print them out, one per line, using the following ‘one-liner’:
puts [join [lsort [info commands]] \n]
As well as sorting lists, you can also search for elements within them using the lsearch command. Like lsort, this command comes with a somewhat bewildering array of options to control its operation. However, its basic usage is very straight-forward: given a list and a pattern to search for, it will return the index of the first element that matches the pattern, or -1 if it cannot be found. By default this uses ‘glob’-style matching, as described in Section 2.1.2:
% lsearch {foo bar jim} b* 1
You can control the way that matching is performed using the following options:
% lsearch -inline {foo bar jim} b* bar
The lsearch command also supports options for specifying the contents of each element (as for lsort), e.g., -ascii, -integer etc., as well as specifying the sort order of sorted lists, finding the nearest element to a given pattern (if it is not exactly matched), or searching within nested lists.
Since Tcl version 8.5
While a list is an ordered collection of elements, a dictionary is an unordered3 mapping from keys to values. Other languages might refer to such data structures as maps or (hash-)tables. Rather than each element having a position and an integer index, as in a list, each element in a dictionary is given a string name. The elements of a Tcl dictionary can be any string value, allowing them to also be used in a similar manner to records or structures in typed languages. The format of a Tcl dictionary is that of a list with an even number of elements: the even indexed elements (0, 2, 4, ...) corresponding to the keys, and the odd indexed elements correspond to the values for those keys. For instance, we can represent a person in the following format:
set jon { name "Jon Doe" address {{A House} {Somewhere}} dob 1-Jan-1970 }
Viewed as a list, the $jon variable contains 6 elements. Viewed as a dictionary, it contains 3 mappings for the keys name, address, and dob. As well as creating dictionaries using a literal string, you can also use the dict create constructor command, that works like the list command, but creates a dictionary. We can retrieve the value associated with a key using the dict get command:
% dict get $jon dob 1-Jan-1970
You can check whether a dictionary contains a mapping for a given key using the dict exists command:
% dict exists $jon dob 1 % dict exists $jon some-other-key 0
Both of these commands can take a series of keys (as separate arguments), allowing you to access elements in nested dictionaries, in a similar manner to that for nested lists. Some other useful dictionary commands are as follows:
% set jane [dict replace $jon name "Jane Doe" gender female] name {Jane Doe} address {{A House} {Somewhere}} dob 1-Jan-1970 gender female
# Change Jon's date-of-birth dict set jon dob 2-Jan-1970
set default { font-weight: normal font-shape: roman } set user { font-family: sans-serif } set author { font-family: serif font-weight: bold } # Merge the styles, preferring author over user over defaults set style [dict merge $default $user $author] # Extract final values from the combined dictionary set family [dict get $style font-family:] set weight [dict get $style font-weight:] set shape [dict get $style font-shape:] puts "Paragraph font: $family $shape $weight"
dict for {key value} $jon { puts "$key = $value" }
In addition to these commands, the dict command also includes a number of convenience commands for manipulating common data in-place within a dictionary. For example, the dict incr command allows efficient in-place incrementing of an integer value stored within a dictionary. For example, we can print a summary of word counts within a text using the following code (ignoring punctuation for now):
set word_counts [dict create] foreach word [split $text] { dict incr word_counts $word } # Display the results dict for {word count} $word_counts { puts [format "%-30s : %d" $word $count] }
You can sort a dictionary using lsort, but not with the -dictionary option! (The name clash is accidental). Instead, you can use the -stride and -index options, taking advantage of the fact that all dictionaries are also lists:
set sorted_words [lsort -integer -stride 2 -index 1 $word_counts]
The dict command comes with powerful functionality for filtering a dictionary value to select just those key-value pairs that match some criteria. The dict filter command supports three different forms of filtering:
key and value filtering returns those key-value pairs whose key or value (respectively) matches one of a set of glob patterns, similar to lsearch.
script filtering allows a script to be run on each key-value pair of the dictionary, and includes only those elements for which the script returns a true value.
For example, given our dictionary of word counts from the previous section, we can return the counts of all words beginning with ‘a’ using:
set a_words [dict filter $word_counts key a*]
Or we could return all words with a count in double figures:
set d_words [dict filter $word_counts value ??]
Finally, we could return all words with a count greater than 15 using:
set frequent [dict filter $word_counts script {key value} { expr {$value > 15} }]
The dict command also includes a very convenient feature for updating the contents of a dictionary variable in-place using an arbitrary script. The dict update and dict with commands unpack a dictionary value into a set of local variables with the names and values given by the contents of the dictionary. A script can then be executed that manipulates these variables. At the end of the script, any changes are read back from the variables and the corresponding changes are made to the original dictionary. For example, if we consider our ‘jon’ dictionary from the start of Section 2.4:
set jon { name "Jon Doe" address {{A House} {Somewhere}} dob 1-Jan-1970 }
We can use the dict with command to unpack this data and manipulate it more concisely than by using individual dict get and dict set commands:
dict with jon { puts "Name: $name" puts "Addr: [join $address {, }]" puts "DOB : $dob" # Change Jon's name set name "Other Jon" } puts "Jon's name is now: [dict get $jon name]"
The dict update command works in a similar fashion, except that it allows you to specify exactly which entries should become variables and also to specify what those variable names should be (rather than just using the key name). This approach can be used to make batch manipulations to a dictionary value, using the full range of Tcl commands for manipulating ordinary variables.
Tcl’s arrays are similar in some respects to dictionaries: they are unordered mappings from string keys to some value.4 In contrast to dictionaries, however, arrays map from string keys to variables rather than values. This means that arrays are not themselves first-class values in Tcl and cannot be passed to and from procedures without some extra work. On the other hand, arrays provide a convenient syntax and can be updated in-place just like other variables. It is an error to have an array variable and a normal variable with the same name.
Array variables can be created and manipulated just like regular variables, but using the special array syntax, which consists of a normal variable name followed by a string key in parentheses. For example, to create an array variable named ‘jon’ containing name, address, and date-of-birth fields, we would write:
set jon(name) "Jon Doe" set jon(address) {"A House" "Somewhere"} set jon(dob) 1-Jan-1970
The array set command can also be used to create an array in one go:
array set jon { name "Jon Doe" address {"A House" Somewhere} dob 1-Jan-1970 }
Once an array has been created, we can use the array syntax to manipulate its elements using normal variable commands. An array entry can be used pretty much everywhere that a normal variable can be:
puts "Name: $jon(name)" set jon(name) "Other Jon" lappend jon(address) "Planet Earth"
The parray command can be used to display the contents of an array variable:
% parray jon jon(address) = {A House} Somewhere {Planet Earth} jon(dob) = 1-Jan-1970 jon(name) = Other Jon
Given the apparent similarity between arrays and dictionaries, it may not seem obvious when you should use one or the other. In fact, the two are quite different things and are useful in different situations. A dictionary should be used when your primary objective is to create a data structure which is naturally modelled as a collection of named fields. Using a dictionary gives you all the advantages of a normal Tcl value: it can be passed to and from procedures and other commands, it can be sent over communication channels (covered in Chapter 5), and it can be easily inserted into other data structures. Tcl also automatically manages the memory for dictionaries, as for other values, so you do not need to worry about freeing up resources when you have finished with it. An array, on the other hand, is a complex stateful entity, and is best used when you want to model a long-lived stateful component in your application. For instance, it can be very useful to associate an array with a graphical user interface component, to hold current information on the state of the display and user interactions. Such information typically lasts a long time and is constantly changing in response to user inputs. A dictionary could also be used in this situation, but an array as a collection of variables brings advantages such as variable traces that really shine in these kinds of situations.
Regular expressions are a compact means of expressing complex patterns for matching strings and extracting information. If you are not already familiar with regular expressions from other languages, such as Perl, then they can be quite daunting at first, with their terse and cryptic syntax. However, once you have mastered the basics they soon become an indispensable tool that you will wonder how you ever managed without! This tutorial will give only a very brief introduction to the basic features of regular expressions and the support provided for them in Tcl. There are many tutorials and good books devoted to the subject available on the Web and in technical bookstores if you wish to learn more. Readers who have come across regular expressions in other languages will find that Tcl uses the familiar POSIX-style syntax for basic regular expressions. Note that Tcl does not use the popular Perl-Compatible Regular Expression (PCRE) syntax used in several modern programming languages, so some advanced features may be slightly different.5
Before getting into the technicalities of regular expressions (REs) in Tcl, it is worth pointing out a little of the theoretical origins and practical limitations. A regular expression is so-called because it is capable of matching a regular language. Such languages are relatively simple in the grand scheme of things: no popular computer programming language has a grammar that simple, for instance. As an example, no regular expression is capable of matching a string only if the braces within it are balanced (i.e., each open brace is paired with exactly one close brace). It is therefore good to be aware of the limits of regular expressions, and when a more appropriate technology should be used. A classic example is extracting information from XML or HTML documents: at first it can look as if a simple RE will do the trick, but it usually then becomes apparent that XML is more complicated than it looks. Soon you end up with a large and unwieldy RE that still doesn’t match every example. The solution is to use a proper XML parser. However, this is not to say that regular expressions are not a powerful tool. Firstly, almost all regular expression engines—and Tcl is no exception here—implement more than the basic regular expressions (BREs) of theory. Various extensions to REs allow writing quite complex patterns quite succinctly, making them a handy tool to have in your toolbox. Even when your problem requires more than a RE can provide, regular expressions can still help when building a more comprehensive parser. In terms of parsing theory, regular expressions are most useful for lexical analysis while more sophisticated tools (such as recursive descent parsers) are more suited to syntactic analysis.
Tcl supports regular expressions primarily through two commands:
regexp ?options? RE string ?matchVar ...? |
regsub ?options? RE string subSpec ?varName? |
| ||||||||||||||||||||||||||||
Table 3: Basic regular expression syntax. | ||||||||||||||||||||||||||||
Regular expressions are similar to the glob-matching that was discussed in
Section 2.1.2. The main difference is the way that sets of
matched characters are handled. In globbing the only way to select sets of
unknown text is the *
symbol, which matches any quantity of any
character. In regular expression matching, this is much more precise. Instead of
using such ‘wild-card’ matching, you instead can say exactly which characters
you wish to match and then apply a modifier to this pattern to say how many
times you wish to match it. For example, the RE a*
will match 0, 1 or any
number of a
characters in a sequence. Modifiers can also be applied to
sub-patterns, not just individual characters, so that [abc]*
will match
strings of any length containing just the characters a
, b
, or
c
(e.g., ‘bbacbaababc’). Another way of writing this pattern would be to
use a character range: [a-c]*
, which again matches strings of any
length (including empty strings) consisting of just characters between a
and c
in the Unicode alphabet. You can also create negated
patterns which match anything that isn’t in a certain set of characters,
for example: [^a-c]
will match any character that isn’t a
,
b
, or c
. As elsewhere in Tcl, a backslash character can be used to
escape any of this special syntax to include literal asterisks and so on in a
pattern. For instance, to match any number of dots, we can use \.*
. Here
are some examples of regular expressions:
# Match an Internet Protocol (IP4) address in dotted-form: set re(ip) {([0-9]{1,3}\.){3}[0-9]{1,3}} # Example: regexp $re(ip) 192.168.0.1 ;# returns 1 (i.e., matched) # Match a Web URL set re(web) {^https?://([^:/]+)(:[0-9]+)?(/.*)?$} regexp $re(web) http://www.tcl.tk:80/docs/ ;# matches regexp $re(web) www.slashdot.org ;# no match # Match an email address set re(email) {^([^@]+)@(.*\.(com|org|net))$} regexp $re(email) [email protected] ;# matches
As you can see from these examples, regular expressions are both difficult to read, but also capable of performing quite complex matching. Let’s examine one of these patterns in more depth to see what is going on. If we take the Web URL pattern, we can break it apart step by step:
Firstly, the entire pattern is surrounded by a ^...$
pair of
anchors. These ensure that the pattern only matches the entire string
that is given, as ^
will only match at the start of the string, and
$
will only match at the end. This is a common pattern which you will
see on many regular expressions.
The first part of the pattern then matches against the protocol
part of the URL: the initial http://
bit that tells us that this is a
HyperText Transfer Protocol (HTTP) link. The pattern here is just a literal
string, with the exception that we also allow secure HTTPS URLs using the
optional pattern s?
. Note that as this pattern follows the initial
^
anchor, it will only match as the very first characters of the input
string.
The next part of a link is the host name of the server we wish to
connect to, for instance wiki.tcl.tk. We could try and specify exactly
which characters can appear in this part of the URL, but it is often easier to
specify those which cannot: in this case, we know that the colon and
forward-slash characters are forbidden in host names, and so we can use a
negated pattern to match anything but these: [^:/]
. As the host name is
not optional, we want to match at least one such character, but with no upper
limit on the size. Therefore we use the +
modifier. Finally, we group
this pattern into a sub-pattern to allow us to capture the host name when
matching: ([^:/]+)
.
A HTTP URL can have an optional port number following the host
name. This is simply a colon followed by a simple positive integer, which we
can match as just :[0-9]+
(a colon followed by 1 or more digits). We
again make this into a sub-pattern for capture and also make the entire
sub-pattern optional: (:[0-9]+)?
.
Finally, we also match the path of the requested document on the
target server. This part is also optional, and we define it as simply anything
else remaining at the end of the URL following an initial forward-slash:
(/.*)?
.
Putting everything together, we arrive at our final pattern:
^https?://([^:/]+)(:[0-9]+)?(/.*)?$
Beyond simple matching, REs are also capable of extracting information from a string while it is being matched. In particular, any sub-pattern in a RE surrounded by parentheses can be extracted into a variable by the regexp command:
regexp $re(web) https://example.com:8080/index.php match host port path
After this command, the match variable will contain the full string that
was matched (i.e., https://example.com:8080/index.php), while the
host, port, and path variables will each contain their
respective parts of the URL: i.e., example.com, 8080 and
/index.php respectively. The regexp command can also be used to
count the number of occurrences of a given pattern, by passing the -all
option as an initial argument:
puts "Number of words: [regexp -all {[^ ]+} $text]"
The regsub command allows substitutions to be performed on a string based
on the matching of a regular expression pattern, either returning the modified
string or saving it in a new variable. The substitution string can itself refer
to elements of the matched RE pattern, by using one or more substitution escapes
of the form \N
where N is a number between 0 and 9: \0
will be
replaced with the string that matched the entire RE, \1
with the string
that matched the first sub-pattern, and so on. You can also use the symbol
&
in place of \0
. For instance, if you want to take a plain text
file and convert it to HTML by making every instance of your own name
highlighted in bold, you could achieve that with a single regsub command:
regsub -all {Neil Madden} $text {<b>\0</b>} html
Regular expressions provide a very powerful method of defining a pattern, but they are a bit awkward to understand and to use properly. In this section we will examine some more examples in detail to help the reader develop an understanding of how to use this technology to best effect. We start with a simple yet non-trivial example: finding floating-point numbers in a line of text. Do not worry: we will keep the problem simple than it is in its full generality. We only consider numbers like 1.0 and not 1.00e+01.
How do we design our regular expression for this problem? By examining typical examples of the strings we want to match:
Examples of valid numbers are: 1.0, .02, + 0., 1, + 1, - 0.0120.
Examples of invalid numbers (that is, strings we do not want to recognise as numbers but which superficially look like them): - , + , 0.0.1, 0..2, + + 1.
Questionable numbers are: + 0000 and 0001. We will accept them—because they normally are accepted and because excluding them makes our pattern more complicated.
A pattern is beginning to emerge:
A number can start with a sign ( - or + ) or with a digit. This can be
captured with the pattern [-+]?
, which matches a single - , and single
+ or nothing.
A number can have zero or more digits in front of a single period (.)
and it can have zero or more digits following the period. Perhaps
[0-9]*\.[0-9]*
will do ...
A number may not contain a period at all. So, revise the previous
expression to: [0-9]*\.?[0-9]*
The complete expression is:
[-+]?[0-9]*\.?[0-9]*
At this point we can do three things:
Try the expression with a bunch of examples like the ones above and see if the proper ones match and the others do not.
Try to make the expression look nicer, before we start testing it. For
instance, the class of characters [0-9]
is so common that it has a
shortcut: \d
. So, we could settle for [-+]?\d*\.?\d*
instead. Or
we could decide that we want to capture the digits before and after the period
for special processing: [-+]?(\d*)\.?(\d*)
Or (and this is a good strategy in general!), we can carefully examine the pattern before we start actually using it.
You may have noticed a problem with the pattern created above: all the parts are optional. That is, each part can match a null string—no sign, no digits before the period, no period, no digits after the period. In other words, our pattern can match an empty string! Our questionable numbers, like ‘+000’ will be perfectly acceptable and we (grudgingly) agree. But, more surprisingly, the strings ‘–1’ and ‘A1B2’ will be accepted too! Why? Because the pattern can start anywhere in the string, so it would match the substrings ‘-1’ and ‘1’ respectively. We need to reconsider our pattern—it is too simple, too permissive:
The character before a minus or a plus, if there is any, cannot be
another digit, period or a minus or plus. Let is make it just a space or a tab
or the beginning of the string: ^|[ \t]
. This may look a bit strange,
but what it says is: either the beginning of the string (^
outside the
square brackets) or (the vertical bar) a space or tab (remember: the
string \t
represents the tab character).
Any sequence of digits before the period (if there is one) is allowed:
\d+\.?
There may be zero digits in front of the period, but then there must be
at least one behind it: \.\d+
And, of course, digits in front and behind the period: \d+\.\d+
The character after the string (if any) can not be a + , - or ‘.’ as
that would get us into the unacceptable number-like strings: $|[^+-.]
(the dollar sign signifies the end of the string).
Before trying to write down the complete regular expression, let us see what different forms we have:
No period: [-+]?\d+
A period without digits before it: [-+]?\.\d+
Digits before a period, and possibly digits after it:
[-+]?\d+\.\d*
Now the synthesis:
(^|[ \t])([-+]?(\d+|\.\d+|\d+\.\d*))($|[^+-.])
Let’s put it to the test:
set pattern {(^|[ \t])([-+]?(\d+|\.\d+|\d+\.\d*))($|[^+-.])} set examples { "1.0" " .02" " +0." "1" "+1" " -0.0120" "+0000" " - " "+." "0001" "0..2" "++1" "A1.0B" "A1" } foreach e $examples { if {[regexp $pattern $e whole _ number digits _]} { puts "PASS: >>$e<<: $number ($whole)" } else { puts "FAIL: >>$e<<: Does not contain a valid number" } }
The result is:
PASS: >>1.0<<: 1.0 (1.0) PASS: >> .02<<: .02 ( .02) PASS: >> +0.<<: +0. ( +0.) PASS: >>1<<: 1 (1) PASS: >>+1<<: +1 (+1) PASS: >> -0.0120<<: -0.0120 ( -0.0120) PASS: >>+0000<<: +0000 (+0000) FAIL: >> - <<: Does not contain a valid number FAIL: >>+.<<: Does not contain a valid number PASS: >>0001<<: 0001 (0001) FAIL: >>0..2<<: Does not contain a valid number FAIL: >>++1<<: Does not contain a valid number FAIL: >>A1.0B<<: Does not contain a valid number FAIL: >>A1<<: Does not contain a valid number
Our pattern correctly accepts the strings we intended to be recognised as
numbers and rejects the others. See if you can adapt it to accept more valid
forms of numbers, such as scientific notation (1.03e-2
) or hexadecimal
(0x12
).
Here are a few other examples of uses for regular expressions:
Text enclosed in quotes: This is ‘quoted text’. If we know what
character will be used for the quotes (e.g., a double-quote) then a simple
pattern will do: "([^"]*)"
. If we do not know the enclosing character
(it can be "
or '
) then we can use a so-called ‘back-reference’
to the first captured sub-string:
regexp {(["'])[^"']*\1} $string
The \1
reference is replaced with whichever character matched at the
start of the string.
You can use this technique to see if a word occurs twice in the same line of text6:
set string "Again and again and again ..." if {[regexp {(\y\w+\y).+\1} $string -> word]} { puts "The word $word occurs at least twice" }
Suppose you need to check the parentheses in some mathematical expression, such as (1 + a)/(1 - b× x). A simple check is to count the open and close parentheses:
if {[regexp -all {(} $exp] != [regexp -all {)} $exp]} { puts "Parentheses unbalanced!" }
Of course, this is just a rough check. A better one is to see if at any point
while scanning the string there are more close parentheses than open ones. We
can easily extract the parentheses into a list and then check that (the
-inline
option helps here):
set parens [regexp -all -inline {[()]} $exp] set counts [string map {( 1 ) -1} $parens] set balance 0 foreach c $counts { if {[incr balance $c] < 0} { puts "Unbalanced!" } } if {$balance != 0} { puts "Unbalanced!" }
We use a trick here by replacing the parentheses in the list with either +1 or -1. This allows us to quickly sum the list to check for balanced parentheses.
A number of tools are available to help experimenting with regular expressions, and a number of books cover the topic in depth, such as J. Friedl’s “Mastering Regular Expressions” [7].
1 For the technically minded, Tcl uses a form of reference-counting for memory management. This suffices for the majority of cases, and simplifies the C interface.
2 The reason for this discrepancy is largely historical, as the list command was already taken as a list constructor.
3 Actually, dictionaries preserve the order of elements as they are given, adding any new keys to the end.
4 This may be confusing to users from other languages where an ‘array’ is usually indexed by an integer rather than a string. Tcl’s arrays are really ‘associative arrays’ or hashtables.
5 It is also worth noting that Tcl’s regular expression engine, written by Henry Spencer, is of a different design to most other languages, with different performance characteristics. Patterns that are slow in some implementations may be faster in Tcl, and vice-versa.
6 This RE uses some advanced features: the
y
anchor matches only at the beginning or end of a word, and
w matches
any alpha-numeric character. The -> symbol is not a new bit of syntax
but just a handy name for the variable that captures the entire match. This
idiom is used frequently in Tcl code to indicate that the whole match was not
required.