Geode LX

From OLPC
Revision as of 01:54, 15 January 2008 by LeeSI (talk | contribs) (Geode moved to Geode LX: clarifying Geode GX/LX/NX because they have different spec. each.)
Jump to navigation Jump to search

This page is about the XO's CPU, an AMD Geode LX.

The cache and TLB information is useful for general high-level optimization. The rest is more useful for writing in assembly, writing compilers, profiling, and possibly debugging.

Instruction set

The Geode LX is very similar to the original Athlon. See the Geode instruction set page for details. You can use -mcpu=geode to make gcc produce Geode-specific code, or -mtune=geode to make gcc produce code that runs everywhere but is optimized for the Geode. If your gcc does not support this, "-Os -mcpu=pentiumpro -mtune=generic" would be a decent choice.

Cache and TLB

The instruction and data L1 caches are both 64 KB, 16-way set associative, with 32 byte line size (page 242). Data is write-back.

The L2 cache is 128 KB, 4-way set associative, with an undocumented line size. It can be configured to be data-only, instruction-only, or combined. It is described as a unified L2 victim cache.

The instruction and data L1 TLBs are both 16-entry, fully associative. The L2 TLB is 64-entry, 2-way set associative.

Processing bulk data (images, video, audio, etc.) in chunks of 64 KB would thus be a decent optimization strategy, remembering that all data (both inputs and outputs) ought to fit within the 64 KB. For example, a 128x128 chunk of data is 64 KB if it consists of 32-bit values such as sRGB pixels, floats, or pointers.

MSRs are provided for observing cache line age and TLB entry age.

The LX has the limitation that it can only execute 3 concurrent prefetches (see "This reflects the internal limit of the LX of 3 outstanding prefetch transactions").

Extra debug registers

In addition to the normal x86 debug registers, the Geode has a few extra. This allows for a fifth data breakpoint and an opcode breakpoint. The opcode breakpoint is particularly interesting, allowing a debugger to break in on the execution of a particular instruction. The ptrace() system call does not currently support these extra debug registers, and would need a patch. It is likely that one would want to patch gdb as well.

FPU

One should mask all exceptions and/or use imprecise exceptions to get full speed. Full speed can be ensured by writing a 1 to MSR 0x1a00, known as FP_MODE_MSR.

Test results

The numbers shown here are calculated from test program runs:

Test1 Results

Test2 Results

Test3 Results

Architecture

The Geode contains an in-order pipelined integer unit (IU) with a tacked on FPU. Neither the integer ALU (IALU) nor the FPU ALU (FALU) are fully pipelined, so every instruction will take at least as many clock cycles as it is defined in the LX databook (so the numbers are throughputs in Intel slang).

The IU can schedule at most 1 instruction per clock cycle. If it encounters an MMX instuction then it puts it into the MMX queue of the FPU which is 6 instructions deep. If the queue is full then it stalls the IU.

TBD: FPU queue, 3DNOW queue.

TBD: FP cycles, FP 2 cycle waitstate, 3DNOW cycles, 3DNOW whether has the 2 cycle waitstate, can it be more than 6 cycles?.

TBDWGRM: FPU pipeline details

The IU has a branch predictor of an unspecified size. Not taking a predicted branch costs 1 cycle (as in the databook). Taking a predicted branch costs 2 cycles (most likely it causes a stall in the pipeline, likely can be eliminated in certain cases, TBDWGRM but not too important). An unpredicted branch costs 5-8 cycles penalty depending on the pipeline state and current cosmic events (TBDWGRM).

TBD: indirect jump

TBD: memory subsystem(s), memory stalls

TBD: L1/L2 cache miss

TBD: memory speed

MMX

The FPU is an out-of-order execution unit which processes MMX instructions as well. The information about the FPU in the databook is not only sparse but somewhere misleading and sometimes totally wrong!!! Out-of-order here means that for example PADDB takes 6 cycles to execute (not listed in the databook, latency) and the 2 cycles listed in the databook are throughputs in the undocumented FPU pipeline. Since the FPU runs asynchronously with the IU, it means that one can schedule MMX ops in every cycle until the MMX queue becomes full.

There are two types of MMX instructions:

1. Normal ops. These use MMX regs or memory operands.

TBD: test real clock cycles and memory access

2. Synchronized ops (I invented this name for them). These use 32 bit regs.

2/a. Synchronized Loads (SL). These load an 32 bit reg into an MMX reg.

It seems that these ops can only be scheduled when there are no dependent ops running and there are no other SL ops running, otherwise they stall the IU. MOVD takes 5 cycles to schedule and 0 to execute. PINSRW takes 5 cycles to schedule and 4 to execute. If the op cannot be scheduled then it stalls the IU (see test 301a-j). The IU can schedule not dependent normal MMX ops after these ops.

2/b. Synchronized Stores (SS). These store part of an MMX reg into an 32 bit reg.

These ops stall the IU until their source values are calculated in the MMX queue and after the source is accessible they take 10 (TEN!!!) cycles to execute. MOVD and PEXTRW are tested, PMOVMSKB is not tested. It seems that the MMX queue keeps running while this op is executing.

TBD: test PMOVMSKB? or is it useless enough?

In practice it means that:

  1. You really do not want to use the result of MMX in integer ops. In other words, if you want to use MMX because there are too few integer regs or there is a fancy MMX op which does something useful then you must forget it!
  2. You really want to use MMX only in streaming data (large enough buffer). If you really need results to be loaded into 32 bit regs then you have to process a large enough buffer and in the second pass load memory into 32 bit regs and use them that time!
  3. You have to think about that certain algorithms can be faster if you implement them with integer instructions instead of MMX.
  4. You have to interleave MMX/integer instructions to fully utilize the Geode. You can do it by scheduling up to 6 MMX instruction before the integer ones or by just interleaving them.
  5. You have to create at least 3 orthogonal dependency chains.

Notes

TBD: To Be Done

TBDWGRM: To Be Done When Get a Real Machine

See also