Forth Lesson 20

From OLPC
Jump to navigation Jump to search
Mitch Bradley's Forth and
Open Firmware Lessons:

Open Firmware System Initialization

Initializing the system in preparation for running the OS is one of Open Firmware's main jobs. Modern computer chips have hundreds or even thousands of hardware registers that must be set up in specific sequences, so this job can be rather complex. Each CPU architecture has its own special requirements, and beyond that, each chipset and each peripheral component adds additional requirements. In general, an Open Firmware port to a given system divides the initialization into two main phases:

  • #Low Level (Assembly Language) Init - setting up the core logic and getting the main memory working. Things initialized during this phase usually include system clock PLLs, CPU performance and feature control settings, bus bridges, pin multiplexors, and DRAM controllers. Modern chipsets often contain numerous device controllers, some of which are unconnected on a given board design. The registers that control which controllers are actually used and their usage modes must be set up in this phase.
  • #High Level (Forth Language) Init - setup that can be deferred until after the main memory setup, and thus can be debugged using Forth debugging tools. Typically this includes user interface peripherals like keyboards and graphics displays, plus I/O peripherals like USB bridges, mass storage controllers, and network interfaces.

The choice of where to init certain hardware is made more complicated by the need to support resume-from-S3 (suspend to RAM) state on some systems. The wakeup code in the resume-from-S3 case is, up to a certain point, the same as the code that runs after power-on. When resuming from S3, the low-level init code exits directly to the OS's resume point, instead of proceeding to the Forth language init sequence. Therefore, any setup that must be done by the firmware during resume must be in the low-level part of the code.

Low Level (Assembly Language) Init

Early-Startup Challenges

The structure of the low-level init code depends a lot on the CPU and on the chipset. The code can be very tricky, because many of the processor features that you normally take for granted are not yet working. The processor is "limping along" in a restricted mode. The early-startup code must work its way out of the limping mode by carefully enabling the system features that allow it to run normally.

One challenge is that RAM is not on when the CPU first starts, so many ordinary coding practices like RAM variables and subroutine nesting using stacks don't work. To work around the lack of RAM, the early startup code must either use internal registers for all read/write information, or else "fake some RAM" with tricks like using unbacked cache as RAM.

The CPU usually runs much slower than normal during early startup. Fast internal CPU clocks are generated by PLL (Phase-Lock Loop) clock multipliers. Before those PLLs are configured for the desired frequency, turned on, and allowed to stabilize, the CPU must use a much-slower clock source. While the CPU is fetching instructions from ROM or FLASH (before it switches to running from cache or RAM), instruction fetch time will be much longer, because ROM and FLASH are slow compared to RAM and cache.

ARM Startup

The MMP2 SoC contains some internal masked ROM that the Security Processor core executes at reset. That code searches a variety of devices, one of which is SPI FLASH, looking for a TIMH signature (Trusted Image Module Header). Each such device is accessed with I/O device protocols, not via memory mapping. Having found the TIMH, the ROM code then decodes data structures telling it to perform various initialization steps and to load some more code from the device into static RAM inside the SoC, which the Security Processor then executes to go to the next step of booting.

In our case, we have set up the "TIM" data structures so the booting proceeds as follows:

WTMI

The "WTMI" (Wireless Trusted Module Image) field contains the "shim.img" code. Shim.img is a very short machine language program (compiled from C source in cforth/src/platform/arm-xo-1.75/{shimmain,spiread}.c) . Its purpose is to read a larger image from SPI FLASH in to SRAM. The only reason this layer is necessary is because the masked ROM SPI-reading code is buggy and fails if a subimage inside TIMH is larger than 4K (apparently, the ROM code for other possible boot sources like NAND FLASH and eMMC does not share this 4K-restriction bug). Thus shim.img must be smaller than 4K (it is 893 bytes at present). shim.img hardcodes the offset into SPI FLASH (0x2000), read size (0x18000), and destination address (0xd1000000, the beginning of SoC internal SRAM). The execution address of shim.img is 0xd1018000, in SRAM just after the area where shim.img copies additional code (0xd100000 .. 0xd1017fff).

In the Marvell boot flow, WTMI is either:

  1. some "trusted" code that provides security functions and wireless access routines that run on the Security Processor in a "locked down" fashion so the main CPU can't see or alter the code, only call it through a secure gateway mechanism, or
  2. a tiny stub that just releases the reset on the main CPU (said CPU is initially held in reset at startup), assuming that other fields have already initialized the DRAM and placed code for the main CPU therein.
OBMI

The "OBMI" (something Bootloader Module Image) field contains "dummy.img". In the boot flow that Marvell uses, OBMI is another intermediate bootloader layer (Marvell calls its code for this OBM) that does some more init and then loads uboot. In our boot flow, dummy.img is a minimal stub that does nothing, but satisfies the masked ROM's requirement (undocumented, empirically determined) that an OBMI field be present. Specifially, the contents of dummy.img is a single ARM "branch to self" instruction that is never executed, but satisfies the masked ROM code.

CFTH

The "CFTH" field contains the CForth image. The masked ROM does not know what to do with that field so it ignores it, but the field is formatted according to the module header rules, so the masked ROM is happy. The data (the actual CForth image) begins at SPI FLASH offset 0x2000 (the same value that is hardcoded in shim.img) and must be no larger than 0x18000 (96K; the size hardcoded in shim.img). shim.img copies CForth from SPI FLASH into SRAM and jumps to it.

CForth begins by performing some basic setup in C, including enabling the clocks to a few SoC sub-devices and turning on the UARTs (cforth/src/platform/arm-xo-1.75/initio.c). It then starts to execute Forth code ("init" in cforth/src/app/arm-xo-1.75/app.fth). This is the first point where it would be possible to get an "ok" prompt, were app.fth modified to run "quit" before "init". "init" performs more SoC initialization, including more clock enabling, timer init, GPIO/pin-muxing setup, cranking up the clock frequency, turning on the DRAM, and fixing the boot fuses in case they were set wrong at the factory.

The CForth runs "ofw" (also defined in app.fth). "ofw" checks the rotate button and if it is held down, ofw exits to the interactive "ok" prompt. This is the first place where you can get an "ok" prompt (via serial) without modifying the code. If the rotate button is not pressed, ofw proceeds to read Open Firmware from SPI FLASH into DRAM. The portion of SPI FLASH beginning at offset 0x20000 is formatted in "Open Boot Dropin Module" style (as used with ".dropins") instead of in Marvell "TIMH" format. cforth/src/app/arm-xo-1.75/dropin.fth contains an abbreviated implementation of the Open Firmware dropin module scanner. It is used to locate the dropin module named "firmware" and read it into DRAM starting at address 0x2fa0.0000. That dropin module is stored in compressed form, so the code first reads the compressed data into DRAM at address 0x0900.0000, then decompresses it to address 0x2fa0.0000.

The Security Processor's memory map differs slightly from the main CPU's, so the SP address 0x2fa0.0000 corresponds to the main CPU's address 0x1fa0.0000.

Finally, CForth executes "ofw-go" (app.fth). ofw-go first stuffs a 4-instruction assembly language routine into DRAM at address 0, which is the reset vector for the main CPU. That routine has the effect of branching from 0 to the beginning of the OFW image at 0x1fa0.0000 . Then ofw-go releases the reset on the main CPU, which performs OFW startup. The final step after ofw-go (in ofw) is a "begin again" loop so the Security Processor will not interfere with OFW. (Ultimately, this will be changed either to something like "halt", or to some actively useful code like keyboard handling or perhaps even running OFW on the Security Processor.)

References:

x86 Startup

16-bit Assembly Language Startup

x86 CPUs begin execution in 16-bit real mode, an instruction execution model dating back to the 8086 processor with its maximum memory size of 1 MByte. The 16-bit execution model is painful to use on modern systems with large address spaces, so the OFW early init code on x86 switches to 32-bit protected mode as soon as possible. The 16-bit real mode code is restricted to one file - rmstart.fth .

The XO-1.5 version is cpu/x86/olpc/via/rmstart.fth . As with all x86 processors, it begins execution in 16-bit real mode from the x86 reset vector at address 0xfffffff0 . It immediately switches to protected mode using a rudimentary Global Descriptor Table that is stored in FLASH, still using the 16-bit execution model from a 16-bit code segment. Then it does a "far jump" to a 32-bit code segment that is defined elsewhere (romreset.bth, as described below). The code in rmstart.fth is assembled by the Forth assembler, set to generate code using the 16-bit instruction set mode.

The XO-1 version cpu/x86/olpc/rmstart.fth is a bit more complicated. It does all of the things described above, but must also do a couple of other things. First, the early-startup code must turn off the microphone LED as quickly as possible to prevent it from visibly flashing during startup (and especially during resume-from-S3, which uses the same early startup code). The LED-turnoff code must be executed almost immediately after reset, so it has to be in the 16-bit early sequence. Second, the Geode CPU chip's PLL turn-on mechanism resets the processor to resume execution at the faster clock speed, so the reset code up to the PLL turn-on step is executed twice. For fast startup, we do the PLL turn-on in 16-bit code, thus minimizing the number of instructions executed prior to the "go faster" point. After these two initial steps, the remaining 16-bit-mode steps are the same as described above.

For low-level hardware debugging, the 16-bit startup code emits values in the range 0x00 to 0x0f on I/O port 80 (the standard PC debug port). Look for "port80" in rmstart.fth to see the specific code values and their locations. (The code in rmstart.fth rarely fails if the CPU is able to execute instructions from FLASH; the only things that are likely to cause failure during this stage are very bad FLASH data corruption or a fundamental problem in the CPU chipset or its bus connections to the FLASH, preventing basic instruction execution from FLASH.)

32-bit Assembly Language Startup

The description below mostly focuses on the XO-1.5 case. The XO-1 assembly language startup is similar at an abstract level - the basic sequence of steps "init bus bridges, init DRAM, either start Forth or resume the OS from S3" is the same - but the details vary widely due to chipset differences.

This code is written using the Forth assembler in 32-bit instruction set mode, making heavy use of assembler macros to hide repetitive details of register usage, bit shuffling, and I/O port usage. For example, the macro expression "80 8fe5 config-wb" assembles a 6-instruction sequence that has the effect of writing the byte value "80" to the PCI configuration address "8fe5". The macros themselves are defined in cpu/x86/startmacros.fth (for generic x86 stuff) and cpu/x86/pc/olpc/via/startmacros.fth (for Via chipset-specific stuff). In the XO-1 startup case, the code uses the "set-msr" macro to set the Geode's many MSR configuration registers. The Via code uses "set-msr" in a few places, but most Via hardware registers are in PCI configuration space, so the Via code has lots of config-wb macros.

The first instruction that is executed in 32-bit mode comes from the file cpu/x86/pc/olpc/via/romreset.bth (for XO-1.5), beginning just after "label startup". The first big step is to setup the cache so it can be used as RAM, thus making the CPU run faster and allowing the use of subroutine calls. That setup is done by including the file "startcacheasram.fth" ; its contents are assembled in-line in the 32-bit startup sequence. Memory-Type Range Registers (MTRRs) are used to overlay a 32KB section of cache on top of the FLASH ROM at the current execution address. Another 32KB section of address space just below is also marked cacheable (with nothing "behind" it), for use as RAM for a subroutine-nesting stack. The cache does not "spill" into the nonexistent backing store because the subroutine nesting stack stays within the 32KB size. (The ability to directly overlay cache above FLASH is relatively unusual; many chipsets do not permit it. On such chipsets, loading code into cache requires more complicated operations.)

After startcacheasram.fth , it's possible to call subroutines. The code (back in romreset.fth) sets up a few key chipset registers, then calls the "cominit" subroutine (defined in "startcominit.fth") to turn on the serial port. The "cominit" code is surprising complicated because of complexities in the Via chipset. A bit that enables the UART is hidden in the graphics chip sequencer block, so you have to partially enable the graphics chip before you can turn on the UART! There are additional complications around the fact that we only want to turn on the UART if a serial dongle is connected to the internal UART connector. To make that determination, we must turn on several other well-hidden features, and must also issue a command to the Embedded Controller to distinguish old board revisions that don't support the dongle-detection hardware signal. Eventually the code enables the UART , and either connects it to the UART I/O pins or leaves it disconnected. (The reason for leaving it disconnected is because the UART I/O pins are shared with the video camera interface; you can't have both working at the same time. So we only connect the UART to those pins if a serial dongle is attached.)

startcominit.fth contains several instances of macro sequences like:

 d# 17 0 devfunc
 40 44 44 mreg
 43 0f 0b mreg
 59 ff 1c mreg
 ...
 end-table

That is a shorthand notation for a very common Via initialization procedure. Most Via hardware setup is done with PCI configuration registers. It's very common to do several consecutive byte-width configuration writes to the same device and function. Many of those configuration registers pack several bit fields into one byte, and a given init step needs to affect just one field, leaving the other fields in that byte untouched. The "devfunc ... mreg ... end-table" syntax optimizes that important case. "d# 17 0 devfunc" sets things up so the next sequence of "mreg" commands apply to device 17, function 0. "RR MM VV mreg" reads register RR, masks off (clears) the bits MM, sets the bits "VV", then writes back the modified value. So "MM" defines the bitfield to which the operation applies, while "VV" is the new value for that field. If "MM" is "ff", the entire byte is affected; in that case there is no need to read the value, so the code just writes "VV" to register "RR". This is implemented efficiently, by generating a compact table of byte values, then calling a subroutine to interpret that table. The interpreter time overhead is insignificant, because the subroutine code runs from cache, where instruction execution is very fast compared to the slow access time of the I/O ports that perform PCI configuration space accesses.

After cominit finishes, the startup code (back in romreset.fth) sends a '+' character to the serial port. If you have a serial dongle attached, this is the first indication that the system is running. The '+' occurs both in the power-up and resume-from-S3 cases.

The next main step is implemented by starthostctl.fth . It sets up more hardware registers, most notably ones from the "Host Bus Control" device (device 0 function 2). Many of those register values depend on several factors, including the (external) Host Bus clock frequency (100, 133, or 200 MHz), the DRAM frequency (200, 166, 333, or 400 MHz), and the DRAM bus width (32 or 64 bits). In some cases they must be determined experimentally. Fortunately, Via provides tables of values that are suitable for various cases, but as you can see, there is a lot of stuff to keep in mind when you are working on code like this.

After starthostctl.fth, there is some code in romreset.bth to force a full system reset in an obscure case that I don't really understand. That code was adapted from Via's coreboot code, where there was little explanation of its full meaning. It doesn't seem to hurt ...

The next step is demodram.fth, which sets up the registers in device 0 function 3 that control DRAM timing, then calls the DDRinit subroutine for each DRAM rank that is configured. The actual timing values for the board are written in cpu/x86/pc/olpc/via/dramtiming.fth . A human enters the timing numbers from the DRAM datasheet into dramtiming.fth, then the Forth assembler, at compile time, calculates the right numbers to put in the various hardware register fields. The DDRinit subroutine is defined in startdraminit.fth . It performs the sequence that writes to the DRAM chips' mode-setting registers, enabling the DRAM chips' DLLs and configuring their burst lengths and other settings. After calling DDRinit, there are some more setup steps for DRAM address multiplexing, clock gating, performance tuning, etc. demodram.fth sends numbers from 0x11 to 0x17 to port 80.

The following steps send numbers from 0x18 to 0x1f to port80.

startgfxinit.fth assigns a region from the top of main memory for use by the display frame buffer, telling both the main memory controller and the display controller where that "stolen" memory is located.

startmtrrinit.fth undoes the "cache as RAM" setup and restores a normal address map, with cache backed by DRAM instead of backed by nothing. That is now possible because the DRAM is now on.

ioinit.fth establishes low-level settings in numerous devices, configuring things like PCI subsystem IDs, clock gating, bus arbitration timers, number of SD slots, USB PHY settings, interrupt routing, and other chip-specific things that are outside of the standard programmer's model for standard devices.

startclkgen.fth reduces power consumption by talking to the clock generator chip over SMBUS, telling it to turn off clock outputs that are not connected to anything.

starttherm.fth configures the CPU chip's thermal monitor, to prevent overheating by automatically reducing the clock speed above 95 degrees C.

startcpuspeed.fth sets the CPU clock speed for maximum performance (the thermal monitor might turn down the speed if necessary).

Back in romreset.fth, we call "init-codec", a subroutine defined in dev/hdaudio/start-cx2058x.fth, to configure the audio CODEC. You might think that the audio CODEC setup could be deferred until later. It turns out that HDAudio CODECs contain some writable registers that describe the board-specific details of how their many ports are connected to various devices and external jacks. High-level software reads those registers to determine which ports to use for which functions. That board-specific setup must be done by the firmware so the high-level driver can be board-independent, and it must be done in both the power-up and resume-from-S3 cases. That is why we have to do it here, in early-startup code. init-codec also configures some internal CODEC chip settings for things like amplifier power and thermal, short-circuit, and GSMark protection. The subroutine includes a simple driver for the HDaudio controller, so it can send the necessary commands to the CODEC chip.

At this point, we are done with the code that is common between the power-on and resume-from-S3 cases. We now check the ACPI Status register to determine the wakeup type.

If it is an S3 wakeup, we output an 'r' character to the serial porth, then establish some critical settings of display registers with startgfxrestore.fth . Ideally, the OS display driver would take care of saving and restoring all of the necessary state, but for some reason it does not, instead expecting the firmware to do some of the setup. startgfxrestore.fth sets the display registers to fixed values instead of saving on suspend and restoring on resume, because the firmware does not get a chance to run just before suspend, so it doesn't have a convenient opportunity to do a save step. In addition to the display register setting, startfgfxrestore.fth also sets the DCON LOAD bit, thus getting an early start on restoring the screen image.

Also in the S3 case, if the OS is Windows, startpcirestore.fth performs some additional Windows-specific setup involving the SD Host Controller.

Then the S3 resume code jumps to the OS via the ACPI "Firmware Waking Vector". In the usual case of a 16-bit waking vector, the jump is tricky, involving a switch back to 16-bit mode, then a switch back to real mode, before finally transferring control to OS in 16-bit real mode. The code that does that switching is defined in cpu/x86/pc/olpc/via/apci.fth after "label do-acpi-wake" . Late startup code copies that code sequence into memory below 1 MB (at the address given by "wake-adr") so that it can be executed in 16-bit mode during the resume from S3.

If the wakeup type is power-up (not S3), execution proceeds as below:

startgtlinit.fth sets a few more registers in device 0 function 2.

startmemtop.fth determines the address of the top of usable memory based on the memory size, the frame buffer size, and the amount of memory that is reserved for System Management Mode, storing that top address in a memory location near "mem-info-pa" for later use by board-independent startup code.

cpu/x86/pc/resetend.fth is the final step in the assembly language startup code. It is board-independent and common to all x86 Open Firmware implementations. It moves the Global Descriptor Table into memory (at the address given by gdt-pa), switches the CPU to use that new GDT location, turns on paged address translation (via the "paging" dropin module compiled from cpu/x86/pc/paging.bth), finds the "firmware" dropin module in FLASH, finds the "inflate" dropin module in FLASH, copies "inflate" to RAM, uses the RAM copy of "inflate" to decompress the "firmware" module from FLASH into RAM, then jumps to the decompressed RAM copy of "firmware". That "firmware" module contains the main Forth portion of Open Firmware. At this point, the assembly language startup sequence is finished. resetend.fth sends numbers from 0x20 to 0x2f to port 80.

High Level (Forth Language) Init

Control passes to the Forth portion of Open Firmware by jumping to the first location in the image (or, on some processors, to a small fixed offset from the first location). That first location typically contains a jump instruction to a short assembly language code sequence whose job is to set up a few basic Forth data structures. The name of that sequence is "prom-cold-code", defined in cpu/x86/pc/boot.fth . It sets up the Forth user area pointer, return stack pointer, data stack pointer, and dictionary pointer, then starts the Forth execution engine, causing it to run the top-level Forth word whose name is "cold".

"cold" runs four init sequences:

init-io
init-io executes the initialization chain word "stand-init-io" to perform the first phase of system-dependent initialization, including giving Forth's memory allocator some memory to work with, attaching a simple serial console driver for debugging, and a few other things that must be done very early.
do-init
do-init executes the initialization chain "init" to setup Forth internal data structures, such as allocating memory for string buffers, command line history, and the like. The very last action of "do-init" is the "Type 'i' to interrupt ..." interaction point, so you can use the Forth debugger to debug anything after that point.
init-environment
init-environment executes the initialization chain "stand-init" to perform later-stage system-dependent initialization, setting up any devices and Open Firmware data structures than have not already been handled.
cold-hook
cold-hook executes the ordinary (not chained) Forth definition "startup" to perform the user-visible part of Open Firmware startup, including turning on the screen, displaying a banner, probing for USB devices, running actions triggered by game keys, playing the startup jingle, and booting the OS in either secure or non-secure mode.

Initialization Chains

A general problem with initialization code is that the specific sequence for a given device is usually defined in a file containing that device's driver, but there must be a top-level procedure somewhere to call each of those sequences in the correct order. Maintaining that top-level procedure can be troublesome, as adding or deleting a driver requires a corresponding edit to the top-level init procedure. Open Firmware solves this problem with initialization chains. An initialization chain is a Forth word that calls another Forth word of the same name, then performs some initialization for its own device. This depends on the fact that Forth lets you have multiple distinct words with the same name. The most recently defined one is visible to the Forth interactive interpreter. So if you write, for example:

  ok : foo  ." Hello" cr  ;
  ok : foo  foo  ." World"  cr  ;
  ok foo

The result is:

  Hello
  World

The core Open Firmware system defines words named stand-init-io ( for init-io ) , init ( for do-init ) , and stand-init (for init-environment) as starting points for three initialization chains. A driver that needs to perform some initialization simply adds itself to the desired chain, depending on the stage at which its initialization must happen. Most drivers use the stand-init chain for late-stage initialization; a few use stand-init-io for early init. The middle init chain is rarely used by drivers. Here is how a driver can add to the stand-init chain:

  : stand-init-io   ( -- )
     stand-init-io
     my_custom_init_code_goes_here
  ;

The compiler will warn you of the duplicate name by saying "stand-init-io isn't unique". You can ignore that warning, because the duplicate name is intentional in this case.

Adding to stand-init could be done in a similar way, but there is a shorthand form because stand-init is so frequently used:

  stand-init: My Message
     my_custom_init_code_goes_here
  ;

The shorthand form has the additional behavior of compiling code to display the message on the serial port during execution of the init code, and also compiling code to send a number to port 80. The numbers start at 0x40 - for the first word defined by "stand-init:", and increment from there. The compiler displays a list of those numbers and their corresponding messages. You can capture that output during compilation to make a debugging guide.

Within a chain, the individual init sequences will be executed in the order that the various incarnations of "stand-init" were defined. The top-level word is called first, but it immediately calls the previous version, which calls the previous version, down to the bottom of the chain. Then, as the various words return to their callers, each version's custom init code is executed before returning.

Interactive Debugging of Initialization Problems

You can look at initialization chains with "see-chain":

  ok see-chain stand-init
  ok see-chain stand-init-io

You can debug the initialization sequence using the serial console. When the system says "Type 'i' to interrupt stand-init sequence", type 'i' quickly (another way is to start typing 'i' repeatedly as soon as you see "Forth"). That will get you an "ok" prompt right at the very end of the "do-init" phase. At that point, the "init-io" (AKA "stand-init-io") early chain has already executed, but the "init-environment" (AKA "stand-init") chain has not started.

If you suspect that the problem happens in the "stand-init" phase, type:

  ok debug stand-init
  ok resume

The debugger will trigger on the top-level "stand-init". You can use the debugger's 'd' keystroke to go down in the chain to deeper levels (earlier stand-init words), use 'u' to go up to higher (later) levels, use <space> to step through an individual level, and use 's' to see where you are.

To step through "startup" - the word that does the final-stage user-visible startup, instead type:

  ok debug startup
  ok resume

You can't use the Forth debugger for problems that happen earlier than the "'i' to interrupt" point. That interrupt point is at the earliest point in the code where the debugger can work reliably. Fortunately, that is a reasonably early point in the overall sequence of things - if the serial port works and the CPU is capable of executing instructions from RAM, there is a good chance that the debugger will work.