Geode LX: Difference between revisions

From OLPC
Jump to navigation Jump to search
(Cache and TLB)
Line 1: Line 1:
= Geode optimization guide =
= Geode optimization guide =


This page is about the AMD Geode processor which is the CPU of the XO. If you are not optimizing programs in assembly then you can just stop reading here. If you are an AMD engineer then feel free to correct my mistakes if any.
This page is about the AMD Geode processor, which is the CPU of the XO. If you are not optimizing programs in assembly or writing a compiler, then only the cache and TLB information will be useful to you.


==Cache and TLB==
TBD: hardware data, links
The instruction and data L1 caches are both 64 KB, 16-way set associative, with an undocumented line size. Data is write-back.
The L2 cache is 128 KB, 4-way set associative, with an undocumented line size. It can be configured to be data-only, instruction-only, or combined. It is described as a unified L2 victim cache.


The instruction and data L1 TLBs are both 16-entry, fully associative.
The L2 TLB is 64-entry, 2-way set associative.

Processing bulk data (images, video, audio, etc.) in chunks of 64 KB would thus be a decent optimization strategy, remembering that all data (both inputs and outputs) ought to fit within the 64 KB. For example, a 128x128 chunk of data is 64 KB if it consists of 32-bit values such as sRGB pixels, floats, or pointers.

==Test results==
The numbers shown here are calculated from test program runs:
The numbers shown here are calculated from test program runs:



Revision as of 16:11, 2 September 2007

Geode optimization guide

This page is about the AMD Geode processor, which is the CPU of the XO. If you are not optimizing programs in assembly or writing a compiler, then only the cache and TLB information will be useful to you.

Cache and TLB

The instruction and data L1 caches are both 64 KB, 16-way set associative, with an undocumented line size. Data is write-back. The L2 cache is 128 KB, 4-way set associative, with an undocumented line size. It can be configured to be data-only, instruction-only, or combined. It is described as a unified L2 victim cache.

The instruction and data L1 TLBs are both 16-entry, fully associative. The L2 TLB is 64-entry, 2-way set associative.

Processing bulk data (images, video, audio, etc.) in chunks of 64 KB would thus be a decent optimization strategy, remembering that all data (both inputs and outputs) ought to fit within the 64 KB. For example, a 128x128 chunk of data is 64 KB if it consists of 32-bit values such as sRGB pixels, floats, or pointers.

Test results

The numbers shown here are calculated from test program runs:

Test1 Results

Test2 Results

Architecture

The Geode contains an in-order pipelined integer unit (IU) with a tacked on FPU. Neither the integer ALU (IALU) nor the FPU ALU (FALU) are fully pipelined, so every instruction will take at least as many clock cycles as it is defined in the LX databook (so the numbers are throughputs in Intel slang).

The IU can schedule at most 1 instruction per clock cycle. If it encounters an MMX instuction then it puts it into the MMX queue of the FPU which is 6 instructions deep. If the queue is full then it stalls the IU.

TBD: FPU queue, 3DNOW queue.

The FPU is an out-of-order execution unit, because every instruction (only MMX tested) takes 6 cycles to execute (latency), the 2 cycles listed in the databook are throughputs in the undocumented FPU pipeline. When you read an MMX register (result of a calculation) to an integer register the FPU pipeline can stall the IALU.

In practice it means that:

  1. You have to think about that certain algorithms can be faster if you implement them with integer instructions instead of MMX.
  2. You have to interleave MMX/integer instructions to fully utilize the Geode. You can do it by scheduling up to 6 MMX instruction before the integer ones or by just interleaving them.
  3. You have to create at least 3 orthogonal dependency chains.

TBD: FP cycles, FP 2 cycle waitstate, 3DNOW cycles, 3DNOW whether has the 2 cycle waitstate, can it be more than 6 cycles?.

TBDWGRM: FPU pipeline details

The IU has a branch predictor of an unspecified size. Not taking a predicted branch costs 1 cycle (as in the databook). Taking a predicted branch costs 2 cycles (most likely it causes a stall in the pipeline, likely can be eliminated in certain cases, TBDWGRM but not too important). An unpredicted branch costs 5-8 cycles penalty depending on the pipeline state and current cosmic events (TBDWGRM).

TBD: indirect jump

TBD: memory subsystem(s), memory stalls

TBD: L1/L2 cache miss

TBD: memory speed

Notes

TBD: To Be Done

TBDWGRM: To Be Done When Get a Real Machine

See also