Forth Lesson 8: Difference between revisions
m (→String Parsing) |
|||
Line 184: | Line 184: | ||
"buffer:" is not limited to counted strings; it can allocate named buffers |
"buffer:" is not limited to counted strings; it can allocate named buffers |
||
of arbitrary size. |
of arbitrary size. |
||
A "buffer:" string is allocated from heap when it is first referenced, not when the buffer is defined. |
|||
==== String Parsing ==== |
==== String Parsing ==== |
Revision as of 06:15, 15 February 2013
Review
In the previous lesson, we learned:
- The conditional control structure "if ... [ else ... ] then"
- Various conditional looping structures like "begin ... until"
- The comparison operators
- "do ... loop" and its variations and wrinkles
- That the compiler has an extension mechanism
- How control structures are implemented using that mechanism
Strings
As in C, strings are not a first-class data type in Forth. Forth has no support for automatic string allocation and garbage collection. This is in keeping with Forth's roots in real-time control on resource-limited systems. But as with C, you can do what you need to do, even if it isn't especially convenient.
Displaying Strings
ok : hello ." Hello, world" cr ; ok hello Hello, world
Finally we have stated the Forth version of the canonical first program! ." (dot-quote) displays a string delimited by the next '"' .
We didn't really have to make a definition:
ok ." Hello, world" cr Hello, world
Technically, in standard Forth, you are supposed to use .( instead of ." outside of a definition (for tedious reasons), as in:
ok .( Hello, world) cr Hello, world
but Open Firmware lets you use either form.
"cr" sends a newline sequence to the output device. Contrast this to the C practice of embedding the newline as an escape sequence inside the string and letting the output device driver transform it into the appropriate sequence. Both approaches have advantages and disadvantages...
Note that, as with everything else, ." is just a Forth word, and thus must be whitespace delimited. Its behavior is to parse a string delimited by " from the input stream and display it. ." is one of those "immediate" words that we talked about in the last lesson. When ." is encountered in compilation state, it executes immediately, parses out the string at compile time, then stores that string in the new definition as a literal, with some code to display it when the new definition later executes.
Literal Strings
If you just want a literal string for use later, i.e. you don't want to display it, use " (quote) instead of ." (dot-quote).
ok : my-string ( -- adr len ) " this is a test" ; ok my-string type this is a test
The stack representation of a string is ( adr len ) - the address of the first character and the number of characters. Characters are just bytes. Only ASCII support is guaranteed, although Open Firmware typically has an ISO-Latin-1 font. The length includes just the characters in the string. There is no explicit notion in Forth of a null-terminated string like in C, so you could put binary data including 0 bytes in a Forth string.
You can also use literal strings interactively:
ok " this is a test" type this is a test
In standard Forth, the string literal word is s" . Open Firmware implements s" for compatibility, but the Open Firmware word " is more convenient, and the Open Firmware source uses it exclusively in preference to s" .
Embedding Control Characters in Strings
Open Firmware's " has a special escape syntax (not in standard Forth) that lets you put control characters and other binary data in strings. The syntax is carefully (and trickily) constructed so as not to conflict with standard usage.
" hello"(12 3a 88 7f)test"r"n"
That creates a string containing "hello" followed by the characters 0x12, 0x3a, 0x88, 0x7f, then "test", then carriage return, linefeed.
Normally the second " would end the string, but in Open Firmware, a second " only ends the string if it is followed by whitespace. If non-whitespace follows the second " , subsititute characters are inserted in the string as follows:
After " Replacement n newline r carriage return t tab f formfeed l linefeed (same as newline) b backspace ! bell " quotation mark ^x control x (x is a printable character) (hh hh hh ...) Sequence of bytes specified in hex
String Storage
Forth doesn't do automatic string allocation or garbage collection, so it's up to you to manage the storage.
Literal strings that are compiled inside colon definitions live in the same memory space that contains the definitions, and should be treated as read-only.
Literal strings entered in interpret state (outside of a definition) are stored in one of two temporary buffers dedicated to that purpose. The system alternates between the two buffers so you can have two interpreted strings active at the same time.
You can allocate space for additional string storage as needed.
Some String Operators
Inside a stack diagram, "$" is often used as a shorthand notation to refer to an "adr len" string. Thus in the stack diagram "( path$ -- )", the number of arguments is 2, the address and length of a string denoting a path.
type ( adr len -- ) Display string 2dup ( adr len -- adr len adr len ) Stack copy of string limits 2drop ( adr len -- ) Discard string limits 2over ( $2 $1 -- $2 $1 $2 ) Stack copy of second string 2tuck ( $2 $1 -- $1 $2 $1 ) Tuck the string down in the stack. comp ( adr1 adr2 len -- diff ) Compare memory buffers. 0 if equal. $= ( $1 $2 -- flag ) Compare strings. True if equal. evaluate ( $ -- ) Interpret string as Forth code
Note that "2dup", "2drop", "2over", and "2tuck", while quite useful for manipulating string limits on the stack, are not really limited to use with strings. They operate on arbitrary pairs of numbers and are used in many other contexts.
Counted Strings
If you need to store a copy of a string in memory and are sure that the string is shorter than 256 bytes, you can use the space-efficient "counted string" format. In memory, a counted string consists of a length byte followed by n data bytes.
place ( $ adr2 -- ) Save string at adr2 in counted form pack ( $ adr2 -- adr2 ) Like place but returns adr2 count ( adr1 -- adr2 len2 ) Return adr,len of counted string $save ( $ adr2 -- $2 ) Save $1 at adr2 counted form, returning the address and length of the copy $cat ( $ adr2 -- ) Append $ to the counted string at adr2
For "count", adr2 is always adr1+1, and len2 is the value of the byte at adr1, as a direct consequence of the way that the counted string format is defined.
Memory Allocation for Strings
alloc-mem ( len -- adr ) Allocate memory (useful for strings) free-mem ( adr len -- ) Free previously-allocated memory
Alloc-mem and free-mem allocate anonymous memory from the heap. To create a named string buffer, use "buffer:".
ok d# 100 buffer: my-string ok " This is a test" my-string place ok my-string count type This is a test
The stack argument is the number of bytes to allocate for the buffer. "buffer:" is not limited to counted strings; it can allocate named buffers of arbitrary size.
A "buffer:" string is allocated from heap when it is first referenced, not when the buffer is defined.
String Parsing
sindex ( $1 $2 -- n ) Search for an occurrence of $1 inside $2. If found, n is the offset within $2 where it was found. If not found, n is -1. split-string ( $1 delim -- tail$ head$ ) Find the first occurrence of the character "delim" in $1. If found, head$ is the portion of $1 up to but not including the delimiter and tail$ is the portion of $1 from the delimiter (inclusive) to the end. If not found, head$ is $1 and tail$ is empty (i.e. its length is 0). left-parse-string ( $1 delim -- head$ tail$ ) Find the first occurrence of "delim" in $1. If found, head$ is the portion of $1 up to but not including the delimiter and tail$ is the portion of $1 after the delimiter (not inclusive) to the end. If not found, head$ is $1 and tail$ is empty (i.e. its length is 0). lex ( $1 delim$ -- tail$ head$ delim true | $1 false ) Find the first occurrence in $1 of any character in delim$ . If found, head$ is the portion of $1 up to but not including the delimiter, tail$ is the portion of $1 after the delimiter (not inclusive) to the end, delim is the actual character found, and the top of the stack is true. If not found, $1 is the original value of $1 and the top of the stack is false.
This is a sampling of some common general-purpose string operators. There are quite a few others.
Thus endeth the lesson.