Forth Lesson 20
Open Firmware System Initialization
Initializing the system in preparation for running the OS is one of Open Firmware's main jobs. Modern computer chips have hundreds or even thousands of hardware registers that must be set up in specific sequences, so this job can be rather complex. Each CPU architecture has its own special requirements, and beyond that, each chipset and each peripheral component adds additional requirements. In general, an Open Firmware port to a given system divides the initialization into two main phases:
- #Low Level (Assembly Language) Init - setting up the core logic and getting the main memory working. Things initialized during this phase usually include system clock PLLs, CPU performance and feature control settings, bus bridges, pin multiplexors, and DRAM controllers. Modern chipsets often contain numerous device controllers, some of which are unconnected on a given board design. The registers that control which controllers are actually used and their usage modes must be set up in this phase.
- #High Level (Forth Language) Init - setup that can be deferred until after the main memory setup, and thus can be debugged using Forth debugging tools. Typically this includes user interface peripherals like keyboards and graphics displays, plus I/O peripherals like USB bridges, mass storage controllers, and network interfaces.
The choice of where to init certain hardware is made more complicated by the need to support resume-from-S3 (suspend to RAM) state on some systems. The wakeup code in the resume-from-S3 case is, up to a certain point, the same as the code that runs after power-on. When resuming from S3, the low-level init code exits directly to the OS's resume point, instead of proceeding to the Forth language init sequence. Therefore, any setup that must be done by the firmware during resume must be in the low-level part of the code.
Low Level (Assembly Language) Init
Early-Startup Challenges
The structure of the low-level init code depends a lot on the CPU and on the chipset. The code can be very tricky, because many of the processor features that you normally take for granted are not yet working. The processor is "limping along" in a restricted mode. The early-startup code must work its way out of the limping mode by carefully enabling the system features that allow it to run normally.
One challenge is that RAM is not on when the CPU first starts, so many ordinary coding practices like RAM variables and subroutine nesting using stacks don't work. To work around the lack of RAM, the early startup code must either use internal registers for all read/write information, or else "fake some RAM" with tricks like using unbacked cache as RAM.
The CPU usually runs much slower than normal during early startup. Fast internal CPU clocks are generated by PLL (Phase-Lock Loop) clock multipliers. Before those PLLs are configured for the desired frequency, turned on, and allowed to stabilize, the CPU must use a much-slower clock source. While the CPU is fetching instructions from ROM or FLASH (before it switches to running from cache or RAM), instruction fetch time will be much longer, because ROM and FLASH are slow compared to RAM and cache.
x86 Startup
16-bit Assembly Language Startup
x86 CPUs begin execution in 16-bit real mode, an instruction execution model dating back to the 8086 processor with its maximum memory size of 1 MByte. The 16-bit execution model is painful to use on modern systems with large address spaces, so the OFW early init code on x86 switches to 32-bit protected mode as soon as possible. The 16-bit real mode code is restricted to one file - rmstart.fth .
The XO-1.5 version is cpu/x86/olpc/via/rmstart.fth . As with all x86 processors, it begins execution in 16-bit real mode from the x86 reset vector at address 0xfffffff0 . It immediately switches to protected mode using a rudimentary Global Descriptor Table that is stored in FLASH, still using the 16-bit execution model from a 16-bit code segment. Then it does a "far jump" to a 32-bit code segment that is defined elsewhere (romreset.bth, as described below). The code in rmstart.fth is assembled by the Forth assembler, set to generate code using the 16-bit instruction set mode.
The XO-1 version cpu/x86/olpc/rmstart.fth is a bit more complicated. It does all of the things described above, but must also do a couple of other things. First, the early-startup code must turn off the microphone LED as quickly as possible to prevent it from visibly flashing during startup (and especially during resume-from-S3, which uses the same early startup code). The LED-turnoff code must be executed almost immediately after reset, so it has to be in the 16-bit early sequence. Second, the Geode CPU chip's PLL turn-on mechanism resets the processor to resume execution at the faster clock speed, so the reset code up to the PLL turn-on step is executed twice. For fast startup, we do the PLL turn-on in 16-bit code, thus minimizing the number of instructions executed prior to the "go faster" point. After these two initial steps, the remaining 16-bit-mode steps are the same as described above.
For low-level hardware debugging, the 16-bit startup code emits values in the range 0x00 to 0x0f on I/O port 80 (the standard PC debug port). Look for "port80" in rmstart.fth to see the specific code values and their locations. (The code in rmstart.fth rarely fails if the CPU is able to execute instructions from FLASH; the only things that are likely to cause failure during this stage are very bad FLASH data corruption or a fundamental problem in the CPU chipset or its bus connections to the FLASH, preventing basic instruction execution from FLASH.)
32-bit Assembly Language Startup
The description below mostly focuses on the XO-1.5 case. The XO-1 assembly language startup is similar at an abstract level - the basic sequence of steps "init bus bridges, init DRAM, either start Forth or resume the OS from S3" is the same - but the details vary widely due to chipset differences.
This code is written using the Forth assembler in 32-bit instruction set mode, making heavy use of assembler macros to hide repetitive details of register usage, bit shuffling, and I/O port usage. For example, the macro expression "80 8fe5 config-wb" assembles a 6-instruction sequence that has the effect of writing the byte value "80" to the PCI configuration address "8fe5". The macros themselves are defined in cpu/x86/startmacros.fth (for generic x86 stuff) and cpu/x86/pc/olpc/via/startmacros.fth (for Via chipset-specific stuff). In the XO-1 startup case, the code uses the "set-msr" macro to set the Geode's many MSR configuration registers. The Via code uses "set-msr" in a few places, but most Via hardware registers are in PCI configuration space, so the Via code has lots of config-wb macros.
The first instruction that is executed in 32-bit mode comes from the file cpu/x86/pc/olpc/via/romreset.bth (for XO-1.5), beginning just after "label startup". The first big step is to setup the cache so it can be used as RAM, thus making the CPU run faster and allowing the use of subroutine calls. That setup is done by including the file "startcacheasram.fth" ; its contents are assembled in-line in the 32-bit startup sequence. Memory-Type Range Registers (MTRRs) are used to overlay a 32KB section of cache on top of the FLASH ROM at the current execution address. Another 32KB section of address space just below is also marked cacheable (with nothing "behind" it), for use as RAM for a subroutine-nesting stack. The cache does not "spill" into the nonexistent backing store because the subroutine nesting stack stays within the 32KB size. (The ability to directly overlay cache above FLASH is relatively unusual; many chipsets do not permit it. On such chipsets, loading code into cache requires more complicated operations.)
After startcacheasram.fth , it's possible to call subroutines. The code (back in romreset.fth) sets up a few key chipset registers, then calls the "cominit" subroutine (defined in "startcominit.fth") to turn on the serial port. The "cominit" code is surprising complicated because of complexities in the Via chipset. A bit that enables the UART is hidden in the graphics chip sequencer block, so you have to partially enable the graphics chip before you can turn on the UART! There are additional complications around the fact that we only want to turn on the UART if a serial dongle is connected to the internal UART connector. To make that determination, we must turn on several other well-hidden features, and must also issue a command to the Embedded Controller to distinguish old board revisions that don't support the dongle-detection hardware signal. Eventually the code enables the UART , and either connects it to the UART I/O pins or leaves it disconnected. (The reason for leaving it disconnected is because the UART I/O pins are shared with the video camera interface; you can't have both working at the same time. So we only connect the UART to those pins if a serial dongle is attached.)
startcominit.fth contains several instances of macro sequences like:
d# 17 0 devfunc 40 44 44 mreg 43 0f 0b mreg 59 ff 1c mreg ... end-table
That is a shorthand notation for a very common Via initialization procedure. Most Via hardware setup is done with PCI configuration registers. It's very common to do several consecutive byte-width configuration writes to the same device and function. Many of those configuration registers pack several bit fields into one byte, and a given init step needs to affect just one field, leaving the other fields in that byte untouched. The "devfunc ... mreg ... end-table" syntax optimizes that important case. "d# 17 0 devfunc" sets things up so the next sequence of "mreg" commands apply to device 17, function 0. "RR MM VV mreg" reads register RR, masks off (clears) the bits MM, sets the bits "VV", then writes back the modified value. So "MM" defines the bitfield to which the operation applies, while "VV" is the new value for that field. If "MM" is "ff", the entire byte is affected; in that case there is no need to read the value, so the code just writes "VV" to register "RR". This is implemented efficiently, by generating a compact table of byte values, then calling a subroutine to interpret that table. The interpreter time overhead is insignificant, because the subroutine code runs from cache, where instruction execution is very fast compared to the slow access time of the I/O ports that perform PCI configuration space accesses.
After cominit finishes, the startup code (back in romreset.fth) sends a '+' character to the serial port. If you have a serial dongle attached, this is the first indication that the system is running. The '+' occurs both in the power-up and resume-from-S3 cases.
The next main step is implemented by starthostctl.fth . It sets up more hardware registers, most notably ones from the "Host Bus Control" device (device 0 function 2).
After starthostctl.fth, there is some code to force a full system reset in an obscure case that I don't really understand. That code was adapted from Via's coreboot code, where there was little explanation of its full meaning. It doesn't seem to hurt ...
The next step is implemented by demodram.fth . It sets up the register in device 0 function 3 that control DRAM timing, then calls the DDRinit subroutine for each DRAM rank that is configured. The actual timing values for the board are written in cpu/x86/pc/olpc/via/dramtiming.fth . A human enters the timing numbers from the DRAM datasheet into dramtiming.fth, then the Forth assembler, at compile time, calculates the right numbers to put in the various hardware register fields. The DDRinit subroutine is defined in startdraminit.fth . It performs the sequence that writes to the DRAM chips' mode-setting registers, enabling the DRAM chips' DLLs and configuring their burst lengths and other settings. After calling DDRinit, there are some more setup steps for DRAM address multiplexing, clock gating, performance tuning, etc. demodram.fth sends numbers from 0x11 to 0x17 to port 80.
The following steps send numbers from 0x18 to 0x1f to port80.
startgfxinit.fth assigns a region from the top of main memory for use by the display frame buffer, telling both the main memory controller and the display controller where that "stolen" memory is located.
startmtrrinit.fth undoes the "cache as RAM" setup and restores a normal address map, with cache backed by DRAM instead of backed by nothing. That is now possible because the DRAM is now on.
ioinit.fth establishes low-level settings in numerous devices, configuring things like PCI subsystem IDs, clock gating, bus arbitration timers, number of SD slots, USB PHY settings, interrupt routing, and other chip-specific things that are outside of the standard programmer's model for standard devices.
startclkgen.fth reduces power consumption by talking to the clock generator chip over SMBUS, telling it to turn off clock outputs that are not connected to anything.
starttherm.fth configures the CPU chip's thermal monitor, to prevent overheating by automatically reducing the clock speed above 95 degrees C.
startcpuspeed.fth sets the CPU clock speed for maximum performance (the thermal monitor might turn down the speed if necessary).
Back in romreset.fth, we call "init-codec", a subroutine defined in dev/hdaudio/start-cx2058x.fth, to configure the audio CODEC. You might think that the audio CODEC setup could be deferred until later. It turns out that HDAudio CODECs contain some writable registers that describe the board-specific details of how their many ports are connected to various devices and external jacks. High-level software reads those registers to determine which ports to use for which functions. That board-specific setup must be done by the firmware so the high-level driver can be board-independent, and it must be done in both the power-up and resume-from-S3 cases. That is why we have to do it here, in early-startup code. init-codec also configures some internal CODEC chip settings for things like amplifier power and thermal, short-circuit, and GSMark protection. The subroutine includes a simple driver for the HDaudio controller, so it can send the necessary commands to the CODEC chip.
At this point, we are done with the code that is common between the power-on and resume-from-S3 cases. We now check the ACPI Status register to determine the wakeup type.
If it is an S3 wakeup, we output an 'r' character to the serial porth, then establish some critical settings of display registers with startgfxrestore.fth . Ideally, the OS display driver would take care of saving and restoring all of the necessary state, but for some reason it does not, instead expecting the firmware to do some of the setup. startgfxrestore.fth sets the display registers to fixed values instead of saving on suspend and restoring on resume, because the firmware does not get a chance to run just before suspend, so it doesn't have a convenient opportunity to do a save step. In addition to the display register setting, startfgfxrestore.fth also sets the DCON LOAD bit, thus getting an early start on restoring the screen image.
Also in the S3 case, if the OS is Windows, startpcirestore.fth performs some additional Windows-specific setup involving the SD Host Controller.
Then the S3 resume code jumps to the OS via the ACPI "Firmware Waking Vector". In the usual case of a 16-bit waking vector, the jump is tricky, involving a switch back to 16-bit mode, then a switch back to real mode, before finally transferring control to OS in 16-bit real mode. The code that does that switching is defined in cpu/x86/pc/olpc/via/apci.fth after "label do-acpi-wake" . Late startup code copies that code sequence into memory below 1 MB (at the address given by "wake-adr") so that it can be executed in 16-bit mode during the resume from S3.
If the wakeup type is power-up (not S3), execution proceeds as below:
startgtlinit.fth sets a few more registers in device 0 function 2.
startmemtop.fth determines the address of the top of usable memory based on the memory size, the frame buffer size, and the amount of memory that is reserved for System Management Mode, storing that top address in a memory location near "mem-info-pa" for later use by board-independent startup code.
cpu/x86/pc/resetend.fth is the final step in the assembly language startup code. It is board-independent and common to all x86 Open Firmware implementations. It moves the Global Descriptor Table into memory (at the address given by gdt-pa), switches the CPU to use that new GDT location, turns on paged address translation (via the "paging" dropin module compiled from cpu/x86/pc/paging.bth), finds the "firmware" dropin module in FLASH, finds the "inflate" dropin module in FLASH, copies "inflate" to RAM, uses the RAM copy of "inflate" to decompress the "firmware" module from FLASH into RAM, then jumps to the decompressed RAM copy of "firmware". That "firmware" module contains the main Forth portion of Open Firmware. At this point, the assembly language startup sequence is finished. resetend.fth sends numbers from 0x20 to 0x2f to port 80.
High Level (Forth Language) Init
Control passes to the Forth portion of Open Firmware by jumping to the first location in the image (or, on some processors, to a small fixed offset from the first location). That first location typically contains a jump instruction to a short assembly language code sequence whose job is to set up a few basic Forth data structures. The name of that sequence is "prom-cold-code", defined in cpu/x86/pc/boot.fth . It sets up the Forth user area pointer, return stack pointer, data stack pointer, and dictionary pointer, then starts the Forth execution engine, causing it to run the top-level Forth word whose name is "cold".
"cold" runs four init sequences:
- init-io
- init-io executes the initialization chain word "stand-init-io" to perform the first phase of system-dependent initialization, including giving Forth's memory allocator some memory to work with, attaching a simple serial console driver for debugging, and a few other things that must be done very early.
- do-init
- do-init executes the initialization chain "init" to setup Forth internal data structures, such as allocating memory for string buffers, command line history, and the like. The very last action of "do-init" is the "Type 'i' to interrupt ..." interaction point, so you can use the Forth debugger to debug anything after that point.
- init-environment
- init-environment executes the initialization chain "stand-init" to perform later-stage system-dependent initialization, setting up any devices and Open Firmware data structures than have not already been handled.
- cold-hook
- cold-hook executes the ordinary (not chained) Forth definition "startup" to perform the user-visible part of Open Firmware startup, including turning on the screen, displaying a banner, probing for USB devices, running actions triggered by game keys, playing the startup jingle, and booting the OS in either secure or non-secure mode.
Initialization Chains
A general problem with initialization code is that the specific sequence for a given device is usually defined in a file containing that device's driver, but there must be a top-level procedure somewhere to call each of those sequences in the correct order. Maintaining that top-level procedure can be troublesome, as adding or deleting a driver requires a corresponding edit to the top-level init procedure. Open Firmware solves this problem with initialization chains. An initialization chain is a Forth word that calls another Forth word of the same name, then performs some initialization for its own device. This depends on the fact that Forth lets you have multiple distinct words with the same name. The most recently defined one is visible to the Forth interactive interpreter. So if you write, for example:
ok : foo ." Hello" cr ; ok : foo foo ." World" cr ; ok foo
The result is:
Hello World
The core Open Firmware system defines words named stand-init-io ( for init-io ) , init ( for do-init ) , and stand-init (for init-environment) as starting points for three initialization chains. A driver that needs to perform some initialization simply adds itself to the desired chain, depending on the stage at which its initialization must happen. Most drivers use the stand-init chain for late-stage initialization; a few use stand-init-io for early init. The middle init chain is rarely used by drivers. Here is how a driver can add to the stand-init chain:
: stand-init-io ( -- ) stand-init-io my_custom_init_code_goes_here ;
The compiler will warn you of the duplicate name by saying "stand-init-io isn't unique". You can ignore that warning, because the duplicate name is intentional in this case.
Adding to stand-init could be done in a similar way, but there is a shorthand form because stand-init is so frequently used:
stand-init: My Message my_custom_init_code_goes_here ;
The shorthand form has the additional behavior of compiling code to display the message on the serial port during execution of the init code, and also compiling code to send a number to port 80. The numbers start at 0x40 - for the first word defined by "stand-init:", and increment from there. The compiler displays a list of those numbers and their corresponding messages. You can capture that output during compilation to make a debugging guide.
Within a chain, the individual init sequences will be executed in the order that the various incarnations of "stand-init" were defined. The top-level word is called first, but it immediately calls the previous version, which calls the previous version, down to the bottom of the chain. Then, as the various words return to their callers, each version's custom init code is executed before returning.
Interactive Debugging of Initialization Problems
You can look at initialization chains with "see-chain":
ok see-chain stand-init ok see-chain stand-init-io
You can debug the initialization sequence using the serial console. When the system says "Type 'i' to interrupt stand-init sequence", type 'i' quickly (another way is to start typing 'i' repeatedly as soon as you see "Forth"). That will get you an "ok" prompt right at the very end of the "do-init" phase. At that point, the "init-io" (AKA "stand-init-io") early chain has already executed, but the "init-environment" (AKA "stand-init") chain has not started.
If you suspect that the problem happens in the "stand-init" phase, type:
ok debug stand-init ok resume
The debugger will trigger on the top-level "stand-init". You can use the debugger's 'd' keystroke to go down in the chain to deeper levels (earlier stand-init words), use 'u' to go up to higher (later) levels, use <space> to step through an individual level, and use 's' to see where you are.
To step through "startup" - the word that does the final-stage user-visible startup, instead type:
ok debug startup ok resume
You can't use the Forth debugger for problems that happen earlier than the "'i' to interrupt" point. That interrupt point is at the earliest point in the code where the debugger can work reliably. Fortunately, that is a reasonably early point in the overall sequence of things - if the serial port works and the CPU is capable of executing instructions from RAM, there is a good chance that the debugger will work.