Geode LX

From OLPC
Revision as of 06:28, 19 August 2007 by NoiseEHC (talk | contribs)
Jump to: navigation, search

Geode optimization guide

This page is about the AMD Geode processor which is the CPU of the XO. If you are not optimizing programs in assembly then you can just stop reading here. If you are an AMD engineer then feel free to correct my mistakes if any.

TBD: hardware data, links

The numbers shown here are calculated from test program runs:

test1 results

test2 results

Architecture

The Geode contains an in-order pipelined integer unit (IU) with a tacked on FPU. Neither the integer ALU (IALU) nor the FPU ALU (FALU) are fully pipelined, so every instruction will take at least as many clock cycles as it is defined in the LX databook (so the numbers are throughputs in Intel slang).

The IU can schedule at most 1 instruction per clock cycle. If it encounters an MMX instuction then it puts it into the MMX queue of the FPU which is 6 instructions deep. If the queue is full then it stalls the IU.

TBD: FPU queue, 3DNOW queue.

The FPU is an out-of-order execution unit, because every instruction (only MMX tested) takes 6 cycles to execute (latency), the 2 cycles listed in the databook are throughputs in the undocumented FPU pipeline. When you read an MMX register (result of a calculation) to an integer register the FPU pipeline can stall the IALU.

In practice it means that:

1. You have to think about that certain algorithms can be faster if you implement them with integer instructions instead of MMX.

2. You have to interleave MMX/integer instructions to fully utilize the Geode. You can do it by scheduling up to 6 MMX instruction before the integer ones or by just interleaving them.

3. You have to create at least 3 orthogonal dependency chains.

TBD: FP cycles, FP 2 cycle waitstate, 3DNOW cycles, 3DNOW whether has the 2 cycle waitstate, can it be more than 6 cycles?.

TBDWGRM: FPU pipeline details

The IU has a branch predictor of an unspecified size. Not taking a predicted branch costs 1 cycle (as in the databook). Taking a predicted branch costs 2 cycles (most likely it causes a stall in the pipeline, likely can be eliminated in certain cases, TBDWGRM but not too important). An unpredicted branch costs 5-8 cycles penalty depending on the pipeline state and current cosmic events (TBDWGRM).

TBD: indirect jump

TBD: memory subsystem(s), memory stalls

TBD: L1/L2 cache miss

TBD: memory speed

Notes

TBD: To Be Done

TBDWGRM: To Be Done Where Get a Real Machine