Geode LX: Difference between revisions

From OLPC
Jump to navigation Jump to search
No edit summary
 
(35 intermediate revisions by 7 users not shown)
Line 1: Line 1:
AMD '''Geode LX''' is the CPU of a XO laptop. From the May 2007 B3 test boards onwards, an [[XO]] has an AMD Geode™ LX 700@0.8W processor at 433MHz.
= Geode optimization guide =


In this page, the cache and TLB information is useful for general high-level optimization and the rest are more useful for writing in assembly, writing compilers, profiling, and possibly debugging.
This page is about the AMD Geode processor which is the CPU of the XO. If you are not optimizing programs in assembly then you can just stop reading here. If you are an AMD engineer then feel free to correct my mistakes if any.


== Architecture ==
TBD: hardware data, links


The Geode contains an in-order pipelined integer unit (IU) with a tacked on FPU. Neither the integer ALU (IALU) nor the FPU ALU (FALU) are fully pipelined, so every instruction will take at least as many clock cycles as it is defined in the LX databook (so the numbers are throughputs in Intel slang). Actually the numbers listed for FP/MMX/3DNow! instructions can be clock counts in some unknown execution stage of the undocumented FPU pipeline.
The numbers shown here are calculated from test program runs:


The '''IU''' can schedule at most 1 instruction per clock cycle. If it encounters an MMX instuction then it puts it into the MMX queue of the FPU which is 6 instructions deep (page 656). If the queue is full then it stalls the IU. FP instructions work the same with a queue lengh of 4 (page 656). There is no test which could prove the existence of the 2 clock delay mentioned on page 656.
test1 results


TBD: 3DNOW queue.
test2 results


TBDWGRM: FPU pipeline details
== Architecture ==


The IU has a '''branch predictor''' of an unspecified size. Not taking a predicted branch costs 1 cycle (as in the databook). Taking a predicted branch costs 2 cycles (most likely it causes a stall in the pipeline, likely can be eliminated in certain cases, TBDWGRM but not too important). An unpredicted branch costs 5-8 cycles penalty depending on the pipeline state and current cosmic events (TBDWGRM). (test15-19)
The Geode contains an in-order pipelined integer unit (IU) with a tacked on FPU. Neither the integer ALU (IALU) nor the FPU ALU (FALU) are fully pipelined, so every instruction will take at least as many clock cycles as it is defined in the LX databook (so the numbers are throughputs in Intel slang).


TBD: indirect jump
The '''IU''' can schedule at most 1 instruction per clock cycle. If it encounters an MMX instuction then it puts it into the MMX queue of the FPU which is 6 instructions deep. If the queue is full then it stalls the IU.


TBD: FPU queue, 3DNOW queue.
TBD: memory subsystem(s), memory stalls


TBD: L1/L2 cache miss
The '''FPU''' is an out-of-order execution unit, because every instruction (only MMX tested) takes 6 cycles to execute (latency), the 2 cycles listed in the databook are throughputs in the undocumented FPU pipeline. When you read an MMX register (result of a calculation) to an integer register the FPU pipeline can stall the IALU.


TBD: memory speed
In practice it means that:


== Instruction set ==
1. You have to think about that certain algorithms can be faster if you implement them with integer instructions instead of MMX.
The instruction set of the Geode LX is a combination of Intel Pentium, AMD Athlon and AMD Geode LX processor specific instructions. Specifically, it supports the Pentium, Pentium Pro, AMD 3DNow! technology and MMX instructions for the AMD Athlon processor. See the [[Geode instruction set]] page for details.


You can use -mcpu=geode to make gcc produce Geode-specific code, or -mtune=geode to make gcc produce code that runs everywhere but is optimized for the Geode. If your gcc does not support this, "-Os -mcpu=pentiumpro -mtune=generic" would be a decent choice.
2. You have to interleave MMX/integer instructions to fully utilize the Geode. You can do it by scheduling up to 6 MMX instruction before the integer ones or by just interleaving them.


== Memory, Cache and TLB ==
3. You have to create at least 3 orthogonal dependency chains.


The instruction and data L1 caches are both 64 KB, 16-way set associative, with 32 byte line size (page 242). Data is write-back. The L1 cache miss latency is ~10-12 clocks (*).
TBD: FP cycles, FP 2 cycle waitstate, 3DNOW cycles, 3DNOW whether has the 2 cycle waitstate, can it be more than 6 cycles?.


The L2 cache is 128 KB, 4-way set associative, with an undocumented line size. It can be configured to be data-only, instruction-only, or combined. It is described as a unified L2 victim cache. The measured line size is 32 bytes (*), the L2 cache miss latency is ~28-35 clocks (*).
TBDWGRM: FPU pipeline details


The instruction and data L1 TLBs are both 16-entry, fully associative.
The IU has a '''branch predictor''' of an unspecified size. Not taking a predicted branch costs 1 cycle (as in the databook). Taking a predicted branch costs 2 cycles (most likely it causes a stall in the pipeline, likely can be eliminated in certain cases, TBDWGRM but not too important). An unpredicted branch costs 5-8 cycles penalty depending on the pipeline state and current cosmic events (TBDWGRM).
The L2 TLB is 64-entry, 2-way set associative. Miss latency ~50+ clocks (*).


MSRs are provided for observing cache line age and TLB entry age.
TBD: indirect jump


The LX has the limitation that it can only execute 3 concurrent prefetches (see [http://dev.laptop.org/git?p=geode-perf;a=blob_plain;f=doc/Memcopy%20Report.htm;hb=HEAD "This reflects the internal limit of the LX of 3 outstanding prefetch transactions"]).
TBD: memory subsystem(s), memory stalls


(*)
TBD: L1/L2 cache miss
Memory subsystem data was measured with The Calibrator.
Download [http://homepages.cwi.nl/~manegold/Calibrator/src/calibrator.c] from [http://homepages.cwi.nl/~manegold/Calibrator/]. Replace all "round" with "round2" in the source file. Compile with "gcc calibrator.c -o calibrator -lm" then run with "./calibrator 433 40M something". The L2 cache is reported as 192KB (64K L1 + 128K L2 since it is a victim cache).


The DDR memory as in the XO (DDR333-166Mhz) has an achievable read speed of 600MB/sec in optimal conditions: sequential access, 8 byte reads (MMX), prefech 64 bytes ahead (2x cache line size). It means that on average the Geode LX can read a new cache line every ~20+ cycle. See test5.
TBD: memory speed

== Extra debug registers ==
In addition to the normal x86 debug registers, the Geode has a few extra. This allows for a fifth data breakpoint and an opcode breakpoint. The opcode breakpoint is particularly interesting, allowing a debugger to break in on the execution of a particular instruction. The ptrace() system call does not currently support these extra debug registers, and would need a patch. It is likely that one would want to patch gdb as well.

== FPU ==

The FPU is an out-of-order execution unit which processes MMX and 3DNow instructions as well. The information about the FPU in the databook is not only sparse but somewhere misleading and sometimes totally wrong!!! Out-of-order here means that a single precision FP instruction (FADD for example) takes 7 cycles to execute (listed as 1 cycle in the databook), an MMX instruction (PADDB for example) takes 6 cycles to execute (listed as 2 cycles in the databook) and a 3DNow instruction (PFADD for example) takes 8 cycles to execute (listed as 2 cycles in the databook). These full instruction lengths (latency) are not listed in the databook, but measured in test403b-e and test405c-d. The 1 and 2 cycles listed in the databook are throughputs in the undocumented FPU pipeline, which means that if you run your program on a big enough buffer (so consecutive elements can be processed in parallel) then you can achieve almost the listed cycle counts.

Because the FPU runs asynchronously with the IU, every data exchange requires synchronization, which can consume a LOT of cycles (this is nowhere documented but measured):
#If the IU stores to memory and it is later used for an FP/MMX/3DNow! load then this FPU instruction must be scheduled 1, 2 or 4 cycles after the IU store (there must be 0, 1 or 3 nops between them, test403a-405d). 3 nops will induce 1 clock penalty for the IU, 5+ nops will idle the FPU.
#If the FPU loads from memory then the instruction which uses the target register (mm0-7 or ST(0-7)) must be scheduled 2 cycles after the FPU load (there must be 1 nop between them, test403a-405d). 0 nop will induce 1 clock penalty for the IU, 3+ nops will idle the FPU.
#If the FPU stores to memory and it is loaded by the IU then this load instruction must be scheduled 8 cycles after the MMX/3DNow! store (movq) and 14 cycles after an FP store (fstps). In case of fistpl it means 17 cycles (7, 13 and 16 nops respectively). Delaying the load for fewer cycles will stall the IU, and if the delay is exactly 7, 13 and 16 cycles (6, 12 and 15 nops) then the IU will take an additional 1 clock penalty. It is measured in test403a-405d.
#If the FPU loads from IU registers (movd, pinsrw) or stores to IU registers (movd, pextrw) then this synchronization results in the stalling of the IU and no matter what number is listed in the databook, the actual running cycle count will be much higher!!! (See synchronized ops.)

Note that these cycle counts and delays does not mean that I know the FPU pipeline and it really takes that many cycles to execute an instruction, it just means that from the perspective of the IU it looks like that (and that is what is needed for assembly optimization).

Note also that the FPU is a single precision unit and calculating for double precision takes a lot more cycles, so if speed is important you should set single precision at the beginning of your program.

void set_single_fpu_precision()
{
short temp = 0;
// disable double precision
__asm__ __volatile__(
"fstcw %[temp] \n\t"
"andw $0xfcff,%[temp] \n\t"
"fldcw %[temp] \n\t"
:
: [temp] "o" (temp)
: "memory"
);
}

== MMX ==

The MMX operations are special because there are synchronized ones which use 32 bit IU registers directly (I invented this name for them). They are either Synchronized Loads (SL) or Synchronized Stores (SS).

Synchronized Loads (SL)

These load an 32 bit reg into an MMX reg. It seems that these ops can only be scheduled only if the following two conditions are met, otherwise they stall the IU (test 301a-j):
#When there are no dependent ops running (it means that there must not be running any FPU instruction whose source or target operand is the SL's target register).
#There are no other SL ops running (does not matter what registers would they use)

MOVD takes 5 cycles to schedule and 0 to execute (so the IU executes for 5 cycles and seems not to block any pending FPU instructions).

PINSRW takes 5 cycles to schedule and 4 to execute (so the IU executes for 5 cycles and after that it steals 4 cycles from executing FPU instructions). The IU can schedule not dependent normal MMX ops after these ops.

Synchronized Stores (SS).

These store part of an MMX reg into an 32 bit reg. These ops stall the IU until their source values are calculated in the MMX queue and after the source is accessible they take 10 (TEN!!!) cycles to execute. MOVD and PEXTRW are tested, PMOVMSKB is not tested. It seems that the MMX queue keeps running while this op is executing.

TBD: test PMOVMSKB? or is it useless enough?

In practice it means that:
#You really do not want to use the result of MMX in integer ops. In other words, if you want to use MMX because there are too few integer regs or there is a fancy MMX op which does something useful then you must forget it!
#You really want to use MMX only in streaming data (large enough buffer). If you really need results to be loaded into 32 bit regs then you have to process a large enough buffer and in the second pass load memory into 32 bit regs and use them that time!
#You have to think about that certain algorithms can be faster if you implement them with integer instructions instead of MMX.
#You have to interleave MMX/integer instructions to fully utilize the Geode. You can do it by scheduling up to 6 MMX instruction before the integer ones or by just interleaving them.
#You have to create at least 3 orthogonal dependency chains.

== Test results ==
The numbers shown here are calculated from test program runs:
* [[media:geodetest_cputest1.zip|Test1]] [[media:geodetest_results1.zip|Results]]
* [[media:geodetest_cputest2.zip|Test2]] [[media:geodetest_results2.zip|Results]]
* [[media:geodetest_cputest3.zip|Test3]] [[media:geodetest_results3.zip|Results]]
* [[media:geodetest_cputest4.zip|Test4 with results in the comments]]
* [[media:geodetest_cputest5.zip|Test5 with results in the comments]]


== Notes ==
== Notes ==


TBD: To Be Done
* TBD: To Be Done
* TBDWGRM: To Be Done When Get a Real Machine

== References ==
* [http://www.amd.com/files/connectivitysolutions/geode/geode_lx/33234G_LX_databook.pdf AMD Geode™ LX Processors Data Book revG]. AMD, 2008-05
(broken link? possible newer version [http://support.amd.com/us/Embedded_TechDocs/33234H_LX_databook.pdf here])

== See also ==

* [[Geode instruction set]]
* [[Geode optimization effort]]


== External links ==
TBDWGRM: To Be Done Where Get a Real Machine
*[http://www.amd.com/us-en/ConnectivitySolutions/ProductInformation/0,,50_2330_9863_13022%5E13057,00.html AMD Geode™ LX Processor Family]. AMD


[[Category:Hardware]]
[[Category:Hardware]]
[[Category:Developers]]
[[Category:Software development]]

Latest revision as of 13:35, 14 November 2010

AMD Geode LX is the CPU of a XO laptop. From the May 2007 B3 test boards onwards, an XO has an AMD Geode™ LX 700@0.8W processor at 433MHz.

In this page, the cache and TLB information is useful for general high-level optimization and the rest are more useful for writing in assembly, writing compilers, profiling, and possibly debugging.

Architecture

The Geode contains an in-order pipelined integer unit (IU) with a tacked on FPU. Neither the integer ALU (IALU) nor the FPU ALU (FALU) are fully pipelined, so every instruction will take at least as many clock cycles as it is defined in the LX databook (so the numbers are throughputs in Intel slang). Actually the numbers listed for FP/MMX/3DNow! instructions can be clock counts in some unknown execution stage of the undocumented FPU pipeline.

The IU can schedule at most 1 instruction per clock cycle. If it encounters an MMX instuction then it puts it into the MMX queue of the FPU which is 6 instructions deep (page 656). If the queue is full then it stalls the IU. FP instructions work the same with a queue lengh of 4 (page 656). There is no test which could prove the existence of the 2 clock delay mentioned on page 656.

TBD: 3DNOW queue.

TBDWGRM: FPU pipeline details

The IU has a branch predictor of an unspecified size. Not taking a predicted branch costs 1 cycle (as in the databook). Taking a predicted branch costs 2 cycles (most likely it causes a stall in the pipeline, likely can be eliminated in certain cases, TBDWGRM but not too important). An unpredicted branch costs 5-8 cycles penalty depending on the pipeline state and current cosmic events (TBDWGRM). (test15-19)

TBD: indirect jump

TBD: memory subsystem(s), memory stalls

TBD: L1/L2 cache miss

TBD: memory speed

Instruction set

The instruction set of the Geode LX is a combination of Intel Pentium, AMD Athlon and AMD Geode LX processor specific instructions. Specifically, it supports the Pentium, Pentium Pro, AMD 3DNow! technology and MMX instructions for the AMD Athlon processor. See the Geode instruction set page for details.

You can use -mcpu=geode to make gcc produce Geode-specific code, or -mtune=geode to make gcc produce code that runs everywhere but is optimized for the Geode. If your gcc does not support this, "-Os -mcpu=pentiumpro -mtune=generic" would be a decent choice.

Memory, Cache and TLB

The instruction and data L1 caches are both 64 KB, 16-way set associative, with 32 byte line size (page 242). Data is write-back. The L1 cache miss latency is ~10-12 clocks (*).

The L2 cache is 128 KB, 4-way set associative, with an undocumented line size. It can be configured to be data-only, instruction-only, or combined. It is described as a unified L2 victim cache. The measured line size is 32 bytes (*), the L2 cache miss latency is ~28-35 clocks (*).

The instruction and data L1 TLBs are both 16-entry, fully associative. The L2 TLB is 64-entry, 2-way set associative. Miss latency ~50+ clocks (*).

MSRs are provided for observing cache line age and TLB entry age.

The LX has the limitation that it can only execute 3 concurrent prefetches (see "This reflects the internal limit of the LX of 3 outstanding prefetch transactions").

(*) Memory subsystem data was measured with The Calibrator. Download [1] from [2]. Replace all "round" with "round2" in the source file. Compile with "gcc calibrator.c -o calibrator -lm" then run with "./calibrator 433 40M something". The L2 cache is reported as 192KB (64K L1 + 128K L2 since it is a victim cache).

The DDR memory as in the XO (DDR333-166Mhz) has an achievable read speed of 600MB/sec in optimal conditions: sequential access, 8 byte reads (MMX), prefech 64 bytes ahead (2x cache line size). It means that on average the Geode LX can read a new cache line every ~20+ cycle. See test5.

Extra debug registers

In addition to the normal x86 debug registers, the Geode has a few extra. This allows for a fifth data breakpoint and an opcode breakpoint. The opcode breakpoint is particularly interesting, allowing a debugger to break in on the execution of a particular instruction. The ptrace() system call does not currently support these extra debug registers, and would need a patch. It is likely that one would want to patch gdb as well.

FPU

The FPU is an out-of-order execution unit which processes MMX and 3DNow instructions as well. The information about the FPU in the databook is not only sparse but somewhere misleading and sometimes totally wrong!!! Out-of-order here means that a single precision FP instruction (FADD for example) takes 7 cycles to execute (listed as 1 cycle in the databook), an MMX instruction (PADDB for example) takes 6 cycles to execute (listed as 2 cycles in the databook) and a 3DNow instruction (PFADD for example) takes 8 cycles to execute (listed as 2 cycles in the databook). These full instruction lengths (latency) are not listed in the databook, but measured in test403b-e and test405c-d. The 1 and 2 cycles listed in the databook are throughputs in the undocumented FPU pipeline, which means that if you run your program on a big enough buffer (so consecutive elements can be processed in parallel) then you can achieve almost the listed cycle counts.

Because the FPU runs asynchronously with the IU, every data exchange requires synchronization, which can consume a LOT of cycles (this is nowhere documented but measured):

  1. If the IU stores to memory and it is later used for an FP/MMX/3DNow! load then this FPU instruction must be scheduled 1, 2 or 4 cycles after the IU store (there must be 0, 1 or 3 nops between them, test403a-405d). 3 nops will induce 1 clock penalty for the IU, 5+ nops will idle the FPU.
  2. If the FPU loads from memory then the instruction which uses the target register (mm0-7 or ST(0-7)) must be scheduled 2 cycles after the FPU load (there must be 1 nop between them, test403a-405d). 0 nop will induce 1 clock penalty for the IU, 3+ nops will idle the FPU.
  3. If the FPU stores to memory and it is loaded by the IU then this load instruction must be scheduled 8 cycles after the MMX/3DNow! store (movq) and 14 cycles after an FP store (fstps). In case of fistpl it means 17 cycles (7, 13 and 16 nops respectively). Delaying the load for fewer cycles will stall the IU, and if the delay is exactly 7, 13 and 16 cycles (6, 12 and 15 nops) then the IU will take an additional 1 clock penalty. It is measured in test403a-405d.
  4. If the FPU loads from IU registers (movd, pinsrw) or stores to IU registers (movd, pextrw) then this synchronization results in the stalling of the IU and no matter what number is listed in the databook, the actual running cycle count will be much higher!!! (See synchronized ops.)

Note that these cycle counts and delays does not mean that I know the FPU pipeline and it really takes that many cycles to execute an instruction, it just means that from the perspective of the IU it looks like that (and that is what is needed for assembly optimization).

Note also that the FPU is a single precision unit and calculating for double precision takes a lot more cycles, so if speed is important you should set single precision at the beginning of your program.

 void set_single_fpu_precision()
 {
   short temp = 0;
   // disable double precision
   __asm__ __volatile__(
   "fstcw %[temp] \n\t"
   "andw $0xfcff,%[temp] \n\t"
   "fldcw %[temp] \n\t"
   :
   : [temp] "o" (temp)
   : "memory"
   );
 }

MMX

The MMX operations are special because there are synchronized ones which use 32 bit IU registers directly (I invented this name for them). They are either Synchronized Loads (SL) or Synchronized Stores (SS).

Synchronized Loads (SL)

These load an 32 bit reg into an MMX reg. It seems that these ops can only be scheduled only if the following two conditions are met, otherwise they stall the IU (test 301a-j):

  1. When there are no dependent ops running (it means that there must not be running any FPU instruction whose source or target operand is the SL's target register).
  2. There are no other SL ops running (does not matter what registers would they use)

MOVD takes 5 cycles to schedule and 0 to execute (so the IU executes for 5 cycles and seems not to block any pending FPU instructions).

PINSRW takes 5 cycles to schedule and 4 to execute (so the IU executes for 5 cycles and after that it steals 4 cycles from executing FPU instructions). The IU can schedule not dependent normal MMX ops after these ops.

Synchronized Stores (SS).

These store part of an MMX reg into an 32 bit reg. These ops stall the IU until their source values are calculated in the MMX queue and after the source is accessible they take 10 (TEN!!!) cycles to execute. MOVD and PEXTRW are tested, PMOVMSKB is not tested. It seems that the MMX queue keeps running while this op is executing.

TBD: test PMOVMSKB? or is it useless enough?

In practice it means that:

  1. You really do not want to use the result of MMX in integer ops. In other words, if you want to use MMX because there are too few integer regs or there is a fancy MMX op which does something useful then you must forget it!
  2. You really want to use MMX only in streaming data (large enough buffer). If you really need results to be loaded into 32 bit regs then you have to process a large enough buffer and in the second pass load memory into 32 bit regs and use them that time!
  3. You have to think about that certain algorithms can be faster if you implement them with integer instructions instead of MMX.
  4. You have to interleave MMX/integer instructions to fully utilize the Geode. You can do it by scheduling up to 6 MMX instruction before the integer ones or by just interleaving them.
  5. You have to create at least 3 orthogonal dependency chains.

Test results

The numbers shown here are calculated from test program runs:

Notes

  • TBD: To Be Done
  • TBDWGRM: To Be Done When Get a Real Machine

References

(broken link? possible newer version here)

See also

External links