Forth Lesson 8

Mitch Bradley's Forth and
Open Firmware Lessons:

Review

The conditional control structure "if ... [ else ... ] then"
Various conditional looping structures like "begin ... until"
The comparison operators
"do ... loop" and its variations and wrinkles
That the compiler has an extension mechanism
How control structures are implemented using that mechanism

Strings

As in C, strings are not a first-class data type in Forth. Forth has no support for automatic string allocation and garbage collection. This is in keeping with Forth's roots in real-time control on resource-limited systems. But as with C, you can do what you need to do, even if it isn't especially convenient.

Displaying Strings

 ok : hello  ." Hello, world"  cr  ;
 ok hello
 Hello, world

Finally we have stated the Forth version of the canonical first program! ." (dot-quote) displays a string delimited by the next '"' .

We didn't really have to make a definition:

 ok ." Hello, world" cr
 Hello, world

Technically, in standard Forth, you are supposed to use .( instead of ." outside of a definition (for tedious reasons), as in:

 ok .( Hello, world) cr
 Hello, world

but Open Firmware lets you use either form.

"cr" sends a newline sequence to the output device. Contrast this to the C practice of embedding the newline as an escape sequence inside the string and letting the output device driver transform it into the appropriate sequence. Both approaches have advantages and disadvantages...

Note that, as with everything else, ." is just a Forth word, and thus must be whitespace delimited. Its behavior is to parse a string delimited by " from the input stream and display it. ." is one of those "immediate" words that we talked about in the last lesson. When ." is encountered in compilation state, it executes immediately, parses out the string at compile time, then stores that string in the new definition as a literal, with some code to display it when the new definition later executes.

Literal Strings

If you just want a literal string for use later, i.e. you don't want to display it, use " (quote) instead of ." (dot-quote).

 ok : my-string  ( -- adr len )  " this is a test"  ;
 ok my-string type
 this is a test

The stack representation of a string is ( adr len ) - the address of the first character and the number of characters. Characters are just bytes. Only ASCII support is guaranteed, although Open Firmware typically has an ISO-Latin-1 font. The length includes just the characters in the string. There is no explicit notion in Forth of a null-terminated string like in C, so you could put binary data including 0 bytes in a Forth string.

You can also use literal strings interactively:

 ok " this is a test" type
 this is a test

In standard Forth, the string literal word is s" . Open Firmware implements s" for compatibility, but the Open Firmware word " is more convenient, and the Open Firmware source uses it exclusively in preference to s" .

Embedding Control Characters in Strings

Open Firmware's " has a special escape syntax (not in standard Forth) that lets you put control characters and other binary data in strings. The syntax is carefully (and trickily) constructed so as not to conflict with standard usage.

 " hello"(12 3a 88 7f)test"r"n"

That creates a string containing "hello" followed by the characters 0x12, 0x3a, 0x88, 0x7f, then "test", then carriage return, linefeed.

Normally the second " would end the string, but in Open Firmware, a second " only ends the string if it is followed by whitespace. If non-whitespace follows the second " , subsititute characters are inserted in the string as follows:

 After "         Replacement
 
 n               newline
 r               carriage return
 t               tab
 f               formfeed
 l               linefeed (same as newline)
 b               backspace
 !               bell
 "               quotation mark
 ^x              control x (x is a printable character)
 (hh hh hh ...)  Sequence of bytes specified in hex

String Storage

Forth doesn't do automatic string allocation or garbage collection, so it's up to you to manage the storage.

Literal strings that are compiled inside colon definitions live in the same memory space that contains the definitions, and should be treated as read-only.

Literal strings entered in interpret state (outside of a definition) are stored in one of two temporary buffers dedicated to that purpose. The system alternates between the two buffers so you can have two interpreted strings active at the same time.

You can allocate space for additional string storage as needed.

Some String Operators

Inside a stack diagram, "$" is often used as a shorthand notation to refer to an "adr len" string. Thus in the stack diagram "( path$ -- )", the number of arguments is 2, the address and length of a string denoting a path.

 type   ( adr len -- )                   Display string
 2dup   ( adr len -- adr len adr len )   Stack copy of string limits
 2drop  ( adr len -- )                   Discard string limits
 2over  ( $2 $1 -- $2 $1 $2 )            Stack copy of second string
 2tuck  ( $2 $1 -- $1 $2 $1 )            Tuck the string down in the stack.
 comp   ( adr1 adr2 len -- diff )        Compare memory buffers. 0 if equal.
 $=     ( $1 $2 -- flag )                Compare strings.  True if equal.
 evaluate  ( $ -- )                      Interpret string as Forth code

Note that "2dup", "2drop", "2over", and "2tuck", while quite useful for manipulating string limits on the stack, are not really limited to use with strings. They operate on arbitrary pairs of numbers and are used in many other contexts.

Counted Strings

If you need to store a copy of a string in memory and are sure that the string is shorter than 256 bytes, you can use the space-efficient "counted string" format. In memory, a counted string consists of a length byte followed by n data bytes.

 place  ( $ adr2 -- )           Save string at adr2 in counted form
 pack   ( $ adr2 -- adr2 )      Like place but returns adr2
 count  ( adr1 -- adr2 len2 )   Return adr,len of counted string
 $save  ( $ adr2 -- $2 )        Save $1 at adr2 counted form, returning
                                the address and length of the copy
 $cat   ( $ adr2 -- )           Append $ to the counted string at adr2

For "count", adr2 is always adr1+1, and len2 is the value of the byte at adr1, as a direct consequence of the way that the counted string format is defined.

Memory Allocation for Strings

 alloc-mem  ( len -- adr )      Allocate memory (useful for strings)
 free-mem  ( adr len -- )       Free previously-allocated memory

Alloc-mem and free-mem allocate anonymous memory from the heap. To create a named string buffer, use "buffer:".

 ok d# 100 buffer: my-string
 ok " This is a test" my-string place
 ok my-string count type
 This is a test

The stack argument is the number of bytes to allocate for the buffer. "buffer:" is not limited to counted strings; it can allocate named buffers of arbitrary size.

A "buffer:" string is allocated from heap when it is first referenced, not when the buffer is defined.

String Parsing

 sindex  ( $1 $2 -- n )
    Search for an occurrence of $1 inside $2.  If found, n is the offset
    within $2 where it was found.  If not found, n is -1.
 
 split-string       ( $1 delim -- tail$ head$ )
    Find the first occurrence of the character "delim" in $1.  If found, head$ is the
    portion of $1 up to but not including the delimiter and tail$ is the
    portion of $1 from the delimiter (inclusive) to the end.  If not
    found, head$ is $1 and tail$ is empty (i.e. its length is 0).
 
 left-parse-string  ( $1 delim -- tail$ head$ )
    Find the first occurrence of "delim" in $1.  If found, head$ is the
    portion of $1 up to but not including the delimiter and tail$ is the
    portion of $1 after the delimiter (not inclusive) to the end.  If not
    found, head$ is $1 and tail$ is empty (i.e. its length is 0).
      
 lex  ( $1 delim$ -- tail$ head$ delim true | $1 false )
    Find the first occurrence in $1 of any character in delim$ .  If found,
    head$ is the portion of $1 up to but not including the delimiter,
    tail$ is the portion of $1 after the delimiter (not inclusive) to the
    end, delim is the actual character found, and the top of the stack is
    true.  If not found, $1 is the original value of $1 and the top of
    the stack is false.

This is a sampling of some common general-purpose string operators. There are quite a few others.

Thus endeth the lesson.

Next Lesson: Open Firmware Device Trees