Geode LX: Difference between revisions

From OLPC
Jump to navigation Jump to search
No edit summary
Line 3: Line 3:
The cache and TLB information is useful for general high-level optimization. The rest is more useful for writing in assembly, writing compilers, profiling, and possibly debugging.
The cache and TLB information is useful for general high-level optimization. The rest is more useful for writing in assembly, writing compilers, profiling, and possibly debugging.


==Instruction set==
== Instruction set ==
The Geode LX is very similar to the original Athlon. See the [[Geode instruction set]] page for details. You can use -mcpu=geode to make gcc produce Geode-specific code, or -mtune=geode to make gcc produce code that runs everywhere but is optimized for the Geode. If your gcc does not support this, "-Os -mcpu=pentiumpro -mtune=generic" would be a decent choice.
The Geode LX is very similar to the original Athlon. See the [[Geode instruction set]] page for details. You can use -mcpu=geode to make gcc produce Geode-specific code, or -mtune=geode to make gcc produce code that runs everywhere but is optimized for the Geode. If your gcc does not support this, "-Os -mcpu=pentiumpro -mtune=generic" would be a decent choice.


==Cache and TLB==
== Cache and TLB ==
The instruction and data L1 caches are both 64 KB, 16-way set associative, with 32 byte line size (page 242). Data is write-back.
The instruction and data L1 caches are both 64 KB, 16-way set associative, with 32 byte line size (page 242). Data is write-back.


Line 18: Line 18:
MSRs are provided for observing cache line age and TLB entry age.
MSRs are provided for observing cache line age and TLB entry age.


==Extra debug registers==
== Extra debug registers ==
In addition to the normal x86 debug registers, the Geode has a few extra. This allows for a fifth data breakpoint and an opcode breakpoint. The opcode breakpoint is particularly interesting, allowing a debugger to break in on the execution of a particular instruction. The ptrace() system call does not currently support these extra debug registers, and would need a patch. It is likely that one would want to patch gdb as well.
In addition to the normal x86 debug registers, the Geode has a few extra. This allows for a fifth data breakpoint and an opcode breakpoint. The opcode breakpoint is particularly interesting, allowing a debugger to break in on the execution of a particular instruction. The ptrace() system call does not currently support these extra debug registers, and would need a patch. It is likely that one would want to patch gdb as well.


==FPU==
== FPU ==
One should mask all exceptions and/or use imprecise exceptions to get full speed. Full speed can be ensured by writing a 1 to MSR 0x1a00, known as FP_MODE_MSR.
One should mask all exceptions and/or use imprecise exceptions to get full speed. Full speed can be ensured by writing a 1 to MSR 0x1a00, known as FP_MODE_MSR.


==Test results==
== Test results ==
The numbers shown here are calculated from test program runs:
The numbers shown here are calculated from test program runs:


Line 40: Line 40:


TBD: FPU queue, 3DNOW queue.
TBD: FPU queue, 3DNOW queue.

The '''FPU''' is an out-of-order execution unit, because every instruction (only MMX tested) takes 6 cycles to execute (latency), the 2 cycles listed in the databook are throughputs in the undocumented FPU pipeline. When you read an MMX register (result of a calculation) to an integer register the FPU pipeline can stall the IALU.

In practice it means that:
#You have to think about that certain algorithms can be faster if you implement them with integer instructions instead of MMX.
#You have to interleave MMX/integer instructions to fully utilize the Geode. You can do it by scheduling up to 6 MMX instruction before the integer ones or by just interleaving them.
#You have to create at least 3 orthogonal dependency chains.


TBD: FP cycles, FP 2 cycle waitstate, 3DNOW cycles, 3DNOW whether has the 2 cycle waitstate, can it be more than 6 cycles?.
TBD: FP cycles, FP 2 cycle waitstate, 3DNOW cycles, 3DNOW whether has the 2 cycle waitstate, can it be more than 6 cycles?.
Line 61: Line 54:


TBD: memory speed
TBD: memory speed

== MMX ==

The '''FPU''' is an out-of-order execution unit, because every instruction (only MMX tested) takes 6 cycles to execute (latency), the 2 cycles listed in the databook are throughputs in the undocumented FPU pipeline. When you read an MMX register (result of a calculation) to an integer register the FPU pipeline can stall the IALU.

In practice it means that:
#You have to think about that certain algorithms can be faster if you implement them with integer instructions instead of MMX.
#You have to interleave MMX/integer instructions to fully utilize the Geode. You can do it by scheduling up to 6 MMX instruction before the integer ones or by just interleaving them.
#You have to create at least 3 orthogonal dependency chains.


== Notes ==
== Notes ==

Revision as of 18:13, 16 September 2007

This page is about the XO's CPU, an AMD Geode LX.

The cache and TLB information is useful for general high-level optimization. The rest is more useful for writing in assembly, writing compilers, profiling, and possibly debugging.

Instruction set

The Geode LX is very similar to the original Athlon. See the Geode instruction set page for details. You can use -mcpu=geode to make gcc produce Geode-specific code, or -mtune=geode to make gcc produce code that runs everywhere but is optimized for the Geode. If your gcc does not support this, "-Os -mcpu=pentiumpro -mtune=generic" would be a decent choice.

Cache and TLB

The instruction and data L1 caches are both 64 KB, 16-way set associative, with 32 byte line size (page 242). Data is write-back.

The L2 cache is 128 KB, 4-way set associative, with an undocumented line size. It can be configured to be data-only, instruction-only, or combined. It is described as a unified L2 victim cache.

The instruction and data L1 TLBs are both 16-entry, fully associative. The L2 TLB is 64-entry, 2-way set associative.

Processing bulk data (images, video, audio, etc.) in chunks of 64 KB would thus be a decent optimization strategy, remembering that all data (both inputs and outputs) ought to fit within the 64 KB. For example, a 128x128 chunk of data is 64 KB if it consists of 32-bit values such as sRGB pixels, floats, or pointers.

MSRs are provided for observing cache line age and TLB entry age.

Extra debug registers

In addition to the normal x86 debug registers, the Geode has a few extra. This allows for a fifth data breakpoint and an opcode breakpoint. The opcode breakpoint is particularly interesting, allowing a debugger to break in on the execution of a particular instruction. The ptrace() system call does not currently support these extra debug registers, and would need a patch. It is likely that one would want to patch gdb as well.

FPU

One should mask all exceptions and/or use imprecise exceptions to get full speed. Full speed can be ensured by writing a 1 to MSR 0x1a00, known as FP_MODE_MSR.

Test results

The numbers shown here are calculated from test program runs:

Test1 Results

Test2 Results

Test3 Results

Architecture

The Geode contains an in-order pipelined integer unit (IU) with a tacked on FPU. Neither the integer ALU (IALU) nor the FPU ALU (FALU) are fully pipelined, so every instruction will take at least as many clock cycles as it is defined in the LX databook (so the numbers are throughputs in Intel slang).

The IU can schedule at most 1 instruction per clock cycle. If it encounters an MMX instuction then it puts it into the MMX queue of the FPU which is 6 instructions deep. If the queue is full then it stalls the IU.

TBD: FPU queue, 3DNOW queue.

TBD: FP cycles, FP 2 cycle waitstate, 3DNOW cycles, 3DNOW whether has the 2 cycle waitstate, can it be more than 6 cycles?.

TBDWGRM: FPU pipeline details

The IU has a branch predictor of an unspecified size. Not taking a predicted branch costs 1 cycle (as in the databook). Taking a predicted branch costs 2 cycles (most likely it causes a stall in the pipeline, likely can be eliminated in certain cases, TBDWGRM but not too important). An unpredicted branch costs 5-8 cycles penalty depending on the pipeline state and current cosmic events (TBDWGRM).

TBD: indirect jump

TBD: memory subsystem(s), memory stalls

TBD: L1/L2 cache miss

TBD: memory speed

MMX

The FPU is an out-of-order execution unit, because every instruction (only MMX tested) takes 6 cycles to execute (latency), the 2 cycles listed in the databook are throughputs in the undocumented FPU pipeline. When you read an MMX register (result of a calculation) to an integer register the FPU pipeline can stall the IALU.

In practice it means that:

  1. You have to think about that certain algorithms can be faster if you implement them with integer instructions instead of MMX.
  2. You have to interleave MMX/integer instructions to fully utilize the Geode. You can do it by scheduling up to 6 MMX instruction before the integer ones or by just interleaving them.
  3. You have to create at least 3 orthogonal dependency chains.

Notes

TBD: To Be Done

TBDWGRM: To Be Done When Get a Real Machine

See also