Geode LX
Geode optimization guide
This page is about the AMD Geode processor which is the CPU of the XO. If you are not optimizing programs in assembly then you can just stop reading here. If you are an AMD engineer then feel free to correct my mistakes if any.
TBD: hardware data, links
The numbers shown here are calculated from test program runs:
Architecture
The Geode contains an in-order pipelined integer unit (IU) with a tacked on FPU. Neither the integer ALU (IALU) nor the FPU ALU (FALU) are fully pipelined, so every instruction will take at least as many clock cycles as it is defined in the LX databook (so the numbers are throughputs in Intel slang).
The IU can schedule at most 1 instruction per clock cycle. If it encounters an MMX instuction then it puts it into the MMX queue of the FPU which is 6 instructions deep. If the queue is full then it stalls the IU.
TBD: FPU queue, 3DNOW queue.
The FPU is an out-of-order execution unit, because every instruction (only MMX tested) takes 6 cycles to execute (latency), the 2 cycles listed in the databook are throughputs in the undocumented FPU pipeline. When you read an MMX register (result of a calculation) to an integer register the FPU pipeline can stall the IALU.
In practice it means that:
- You have to think about that certain algorithms can be faster if you implement them with integer instructions instead of MMX.
- You have to interleave MMX/integer instructions to fully utilize the Geode. You can do it by scheduling up to 6 MMX instruction before the integer ones or by just interleaving them.
- You have to create at least 3 orthogonal dependency chains.
TBD: FP cycles, FP 2 cycle waitstate, 3DNOW cycles, 3DNOW whether has the 2 cycle waitstate, can it be more than 6 cycles?.
TBDWGRM: FPU pipeline details
The IU has a branch predictor of an unspecified size. Not taking a predicted branch costs 1 cycle (as in the databook). Taking a predicted branch costs 2 cycles (most likely it causes a stall in the pipeline, likely can be eliminated in certain cases, TBDWGRM but not too important). An unpredicted branch costs 5-8 cycles penalty depending on the pipeline state and current cosmic events (TBDWGRM).
TBD: indirect jump
TBD: memory subsystem(s), memory stalls
TBD: L1/L2 cache miss
TBD: memory speed
Notes
TBD: To Be Done
TBDWGRM: To Be Done When Get a Real Machine