Geode LX: Difference between revisions

From OLPC
Jump to navigation Jump to search
No edit summary
 
(15 intermediate revisions by 3 users not shown)
Line 1: Line 1:
AMD '''Geode LX''' is the CPU of a XO laptop. At this time, an XO has an AMD Geode™ LX 700@0.8W processor at 433MHz.
AMD '''Geode LX''' is the CPU of a XO laptop. From the May 2007 B3 test boards onwards, an [[XO]] has an AMD Geode™ LX 700@0.8W processor at 433MHz.


In this page, the cache and TLB information is useful for general high-level optimization and the rest are more useful for writing in assembly, writing compilers, profiling, and possibly debugging.
In this page, the cache and TLB information is useful for general high-level optimization and the rest are more useful for writing in assembly, writing compilers, profiling, and possibly debugging.
Line 5: Line 5:
== Architecture ==
== Architecture ==


The Geode contains an in-order pipelined integer unit (IU) with a tacked on FPU. Neither the integer ALU (IALU) nor the FPU ALU (FALU) are fully pipelined, so every instruction will take at least as many clock cycles as it is defined in the LX databook (so the numbers are throughputs in Intel slang).
The Geode contains an in-order pipelined integer unit (IU) with a tacked on FPU. Neither the integer ALU (IALU) nor the FPU ALU (FALU) are fully pipelined, so every instruction will take at least as many clock cycles as it is defined in the LX databook (so the numbers are throughputs in Intel slang). Actually the numbers listed for FP/MMX/3DNow! instructions can be clock counts in some unknown execution stage of the undocumented FPU pipeline.


The '''IU''' can schedule at most 1 instruction per clock cycle. If it encounters an MMX instuction then it puts it into the MMX queue of the FPU which is 6 instructions deep. If the queue is full then it stalls the IU.
The '''IU''' can schedule at most 1 instruction per clock cycle. If it encounters an MMX instuction then it puts it into the MMX queue of the FPU which is 6 instructions deep (page 656). If the queue is full then it stalls the IU. FP instructions work the same with a queue lengh of 4 (page 656). There is no test which could prove the existence of the 2 clock delay mentioned on page 656.


TBD: FPU queue, 3DNOW queue.
TBD: 3DNOW queue.

TBD: FP cycles, FP 2 cycle waitstate, 3DNOW cycles, 3DNOW whether has the 2 cycle waitstate, can it be more than 6 cycles?.


TBDWGRM: FPU pipeline details
TBDWGRM: FPU pipeline details


The IU has a '''branch predictor''' of an unspecified size. Not taking a predicted branch costs 1 cycle (as in the databook). Taking a predicted branch costs 2 cycles (most likely it causes a stall in the pipeline, likely can be eliminated in certain cases, TBDWGRM but not too important). An unpredicted branch costs 5-8 cycles penalty depending on the pipeline state and current cosmic events (TBDWGRM).
The IU has a '''branch predictor''' of an unspecified size. Not taking a predicted branch costs 1 cycle (as in the databook). Taking a predicted branch costs 2 cycles (most likely it causes a stall in the pipeline, likely can be eliminated in certain cases, TBDWGRM but not too important). An unpredicted branch costs 5-8 cycles penalty depending on the pipeline state and current cosmic events (TBDWGRM). (test15-19)


TBD: indirect jump
TBD: indirect jump
Line 30: Line 28:
You can use -mcpu=geode to make gcc produce Geode-specific code, or -mtune=geode to make gcc produce code that runs everywhere but is optimized for the Geode. If your gcc does not support this, "-Os -mcpu=pentiumpro -mtune=generic" would be a decent choice.
You can use -mcpu=geode to make gcc produce Geode-specific code, or -mtune=geode to make gcc produce code that runs everywhere but is optimized for the Geode. If your gcc does not support this, "-Os -mcpu=pentiumpro -mtune=generic" would be a decent choice.


== Cache and TLB ==
== Memory, Cache and TLB ==
The instruction and data L1 caches are both 64 KB, 16-way set associative, with 32 byte line size (page 242). Data is write-back.


The L2 cache is 128 KB, 4-way set associative, with an undocumented line size. It can be configured to be data-only, instruction-only, or combined. It is described as a unified L2 victim cache.
The instruction and data L1 caches are both 64 KB, 16-way set associative, with 32 byte line size (page 242). Data is write-back. The L1 cache miss latency is ~10-12 clocks (*).


The L2 cache is 128 KB, 4-way set associative, with an undocumented line size. It can be configured to be data-only, instruction-only, or combined. It is described as a unified L2 victim cache. The measured line size is 32 bytes (*), the L2 cache miss latency is ~28-35 clocks (*).
The instruction and data L1 TLBs are both 16-entry, fully associative.
The L2 TLB is 64-entry, 2-way set associative.


The instruction and data L1 TLBs are both 16-entry, fully associative.
Processing bulk data (images, video, audio, etc.) in chunks of 64 KB would thus be a decent optimization strategy, remembering that all data (both inputs and outputs) ought to fit within the 64 KB. For example, a 128x128 chunk of data is 64 KB if it consists of 32-bit values such as sRGB pixels, floats, or pointers.
The L2 TLB is 64-entry, 2-way set associative. Miss latency ~50+ clocks (*).


MSRs are provided for observing cache line age and TLB entry age.
MSRs are provided for observing cache line age and TLB entry age.


The LX has the limitation that it can only execute 3 concurrent prefetches (see [http://dev.laptop.org/git?p=geode-perf;a=blob_plain;f=doc/Memcopy%20Report.htm;hb=HEAD "This reflects the internal limit of the LX of 3 outstanding prefetch transactions"]).
The LX has the limitation that it can only execute 3 concurrent prefetches (see [http://dev.laptop.org/git?p=geode-perf;a=blob_plain;f=doc/Memcopy%20Report.htm;hb=HEAD "This reflects the internal limit of the LX of 3 outstanding prefetch transactions"]).

(*)
Memory subsystem data was measured with The Calibrator.
Download [http://homepages.cwi.nl/~manegold/Calibrator/src/calibrator.c] from [http://homepages.cwi.nl/~manegold/Calibrator/]. Replace all "round" with "round2" in the source file. Compile with "gcc calibrator.c -o calibrator -lm" then run with "./calibrator 433 40M something". The L2 cache is reported as 192KB (64K L1 + 128K L2 since it is a victim cache).

The DDR memory as in the XO (DDR333-166Mhz) has an achievable read speed of 600MB/sec in optimal conditions: sequential access, 8 byte reads (MMX), prefech 64 bytes ahead (2x cache line size). It means that on average the Geode LX can read a new cache line every ~20+ cycle. See test5.


== Extra debug registers ==
== Extra debug registers ==
Line 48: Line 51:


== FPU ==
== FPU ==
One should mask all exceptions and/or use imprecise exceptions to get full speed. Full speed can be ensured by writing a 1 to MSR 0x1a00, known as FP_MODE_MSR.


The FPU is an out-of-order execution unit which processes MMX and 3DNow instructions as well. The information about the FPU in the databook is not only sparse but somewhere misleading and sometimes totally wrong!!! Out-of-order here means that a single precision FP instruction (FADD for example) takes 7 cycles to execute (listed as 1 cycle in the databook), an MMX instruction (PADDB for example) takes 6 cycles to execute (listed as 2 cycles in the databook) and a 3DNow instruction (PFADD for example) takes 8 cycles to execute (listed as 2 cycles in the databook). These full instruction lengths (latency) are not listed in the databook, but measured in test403b-e and test405c-d. The 1 and 2 cycles listed in the databook are throughputs in the undocumented FPU pipeline, which means that if you run your program on a big enough buffer (so consecutive elements can be processed in parallel) then you can achieve almost the listed cycle counts.
== Test results ==
The numbers shown here are calculated from test program runs:


Because the FPU runs asynchronously with the IU, every data exchange requires synchronization, which can consume a LOT of cycles (this is nowhere documented but measured):
[[media:geodetest_cputest1.zip|Test1]] [[media:geodetest_results1.zip|Results]]
#If the IU stores to memory and it is later used for an FP/MMX/3DNow! load then this FPU instruction must be scheduled 1, 2 or 4 cycles after the IU store (there must be 0, 1 or 3 nops between them, test403a-405d). 3 nops will induce 1 clock penalty for the IU, 5+ nops will idle the FPU.
#If the FPU loads from memory then the instruction which uses the target register (mm0-7 or ST(0-7)) must be scheduled 2 cycles after the FPU load (there must be 1 nop between them, test403a-405d). 0 nop will induce 1 clock penalty for the IU, 3+ nops will idle the FPU.
#If the FPU stores to memory and it is loaded by the IU then this load instruction must be scheduled 8 cycles after the MMX/3DNow! store (movq) and 14 cycles after an FP store (fstps). In case of fistpl it means 17 cycles (7, 13 and 16 nops respectively). Delaying the load for fewer cycles will stall the IU, and if the delay is exactly 7, 13 and 16 cycles (6, 12 and 15 nops) then the IU will take an additional 1 clock penalty. It is measured in test403a-405d.
#If the FPU loads from IU registers (movd, pinsrw) or stores to IU registers (movd, pextrw) then this synchronization results in the stalling of the IU and no matter what number is listed in the databook, the actual running cycle count will be much higher!!! (See synchronized ops.)


Note that these cycle counts and delays does not mean that I know the FPU pipeline and it really takes that many cycles to execute an instruction, it just means that from the perspective of the IU it looks like that (and that is what is needed for assembly optimization).
[[media:geodetest_cputest2.zip|Test2]] [[media:geodetest_results2.zip|Results]]


Note also that the FPU is a single precision unit and calculating for double precision takes a lot more cycles, so if speed is important you should set single precision at the beginning of your program.
[[media:geodetest_cputest3.zip|Test3]] [[media:geodetest_results3.zip|Results]]

void set_single_fpu_precision()
{
short temp = 0;
// disable double precision
__asm__ __volatile__(
"fstcw %[temp] \n\t"
"andw $0xfcff,%[temp] \n\t"
"fldcw %[temp] \n\t"
:
: [temp] "o" (temp)
: "memory"
);
}


== MMX ==
== MMX ==


The MMX operations are special because there are synchronized ones which use 32 bit IU registers directly (I invented this name for them). They are either Synchronized Loads (SL) or Synchronized Stores (SS).
The '''FPU''' is an out-of-order execution unit which processes MMX instructions as well. The information about the FPU in the databook is not only sparse but somewhere misleading and sometimes totally wrong!!! Out-of-order here means that for example PADDB takes 6 cycles to execute (not listed in the databook, latency) and the 2 cycles listed in the databook are throughputs in the undocumented FPU pipeline. Since the FPU runs asynchronously with the IU, it means that one can schedule MMX ops in every cycle until the MMX queue becomes full.


Synchronized Loads (SL)
There are two types of MMX instructions:


These load an 32 bit reg into an MMX reg. It seems that these ops can only be scheduled only if the following two conditions are met, otherwise they stall the IU (test 301a-j):
1. Normal ops. These use MMX regs or memory operands.
#When there are no dependent ops running (it means that there must not be running any FPU instruction whose source or target operand is the SL's target register).
#There are no other SL ops running (does not matter what registers would they use)


MOVD takes 5 cycles to schedule and 0 to execute (so the IU executes for 5 cycles and seems not to block any pending FPU instructions).
TBD: test real clock cycles and memory access


PINSRW takes 5 cycles to schedule and 4 to execute (so the IU executes for 5 cycles and after that it steals 4 cycles from executing FPU instructions). The IU can schedule not dependent normal MMX ops after these ops.
2. Synchronized ops (I invented this name for them). These use 32 bit regs.


2/a. Synchronized Loads (SL). These load an 32 bit reg into an MMX reg.
Synchronized Stores (SS).


These store part of an MMX reg into an 32 bit reg. These ops stall the IU until their source values are calculated in the MMX queue and after the source is accessible they take 10 (TEN!!!) cycles to execute. MOVD and PEXTRW are tested, PMOVMSKB is not tested. It seems that the MMX queue keeps running while this op is executing.
It seems that these ops can only be scheduled when there are no dependent ops running and there are no other SL ops running, otherwise they stall the IU. MOVD takes 5 cycles to schedule and 0 to execute. PINSRW takes 5 cycles to schedule and 4 to execute. If the op cannot be scheduled then it stalls the IU (see test 301a-j). The IU can schedule not dependent normal MMX ops after these ops.

2/b. Synchronized Stores (SS). These store part of an MMX reg into an 32 bit reg.

These ops stall the IU until their source values are calculated in the MMX queue and after the source is accessible they take 10 (TEN!!!) cycles to execute. MOVD and PEXTRW are tested, PMOVMSKB is not tested. It seems that the MMX queue keeps running while this op is executing.


TBD: test PMOVMSKB? or is it useless enough?
TBD: test PMOVMSKB? or is it useless enough?
Line 88: Line 105:
#You have to create at least 3 orthogonal dependency chains.
#You have to create at least 3 orthogonal dependency chains.


== Notes ==
== Test results ==
The numbers shown here are calculated from test program runs:
* [[media:geodetest_cputest1.zip|Test1]] [[media:geodetest_results1.zip|Results]]
* [[media:geodetest_cputest2.zip|Test2]] [[media:geodetest_results2.zip|Results]]
* [[media:geodetest_cputest3.zip|Test3]] [[media:geodetest_results3.zip|Results]]
* [[media:geodetest_cputest4.zip|Test4 with results in the comments]]
* [[media:geodetest_cputest5.zip|Test5 with results in the comments]]


== Notes ==
TBD: To Be Done


TBDWGRM: To Be Done When Get a Real Machine
* TBD: To Be Done
* TBDWGRM: To Be Done When Get a Real Machine


== References ==
== References ==
* [http://www.amd.com/files/connectivitysolutions/geode/geode_lx/33234F_LX_databook.pdf AMD Geode™ LX Processors Data Book]. AMD, 2007-05-30.
* [http://www.amd.com/files/connectivitysolutions/geode/geode_lx/33234G_LX_databook.pdf AMD Geode™ LX Processors Data Book revG]. AMD, 2008-05
(broken link? possible newer version [http://support.amd.com/us/Embedded_TechDocs/33234H_LX_databook.pdf here])


== See also ==
== See also ==


* [[Geode instruction set]]
* [[Geode instruction set]]
* [[Geode Optimization Effort]]
* [[Geode optimization effort]]


== External links ==
== External links ==

Latest revision as of 13:35, 14 November 2010

AMD Geode LX is the CPU of a XO laptop. From the May 2007 B3 test boards onwards, an XO has an AMD Geode™ LX 700@0.8W processor at 433MHz.

In this page, the cache and TLB information is useful for general high-level optimization and the rest are more useful for writing in assembly, writing compilers, profiling, and possibly debugging.

Architecture

The Geode contains an in-order pipelined integer unit (IU) with a tacked on FPU. Neither the integer ALU (IALU) nor the FPU ALU (FALU) are fully pipelined, so every instruction will take at least as many clock cycles as it is defined in the LX databook (so the numbers are throughputs in Intel slang). Actually the numbers listed for FP/MMX/3DNow! instructions can be clock counts in some unknown execution stage of the undocumented FPU pipeline.

The IU can schedule at most 1 instruction per clock cycle. If it encounters an MMX instuction then it puts it into the MMX queue of the FPU which is 6 instructions deep (page 656). If the queue is full then it stalls the IU. FP instructions work the same with a queue lengh of 4 (page 656). There is no test which could prove the existence of the 2 clock delay mentioned on page 656.

TBD: 3DNOW queue.

TBDWGRM: FPU pipeline details

The IU has a branch predictor of an unspecified size. Not taking a predicted branch costs 1 cycle (as in the databook). Taking a predicted branch costs 2 cycles (most likely it causes a stall in the pipeline, likely can be eliminated in certain cases, TBDWGRM but not too important). An unpredicted branch costs 5-8 cycles penalty depending on the pipeline state and current cosmic events (TBDWGRM). (test15-19)

TBD: indirect jump

TBD: memory subsystem(s), memory stalls

TBD: L1/L2 cache miss

TBD: memory speed

Instruction set

The instruction set of the Geode LX is a combination of Intel Pentium, AMD Athlon and AMD Geode LX processor specific instructions. Specifically, it supports the Pentium, Pentium Pro, AMD 3DNow! technology and MMX instructions for the AMD Athlon processor. See the Geode instruction set page for details.

You can use -mcpu=geode to make gcc produce Geode-specific code, or -mtune=geode to make gcc produce code that runs everywhere but is optimized for the Geode. If your gcc does not support this, "-Os -mcpu=pentiumpro -mtune=generic" would be a decent choice.

Memory, Cache and TLB

The instruction and data L1 caches are both 64 KB, 16-way set associative, with 32 byte line size (page 242). Data is write-back. The L1 cache miss latency is ~10-12 clocks (*).

The L2 cache is 128 KB, 4-way set associative, with an undocumented line size. It can be configured to be data-only, instruction-only, or combined. It is described as a unified L2 victim cache. The measured line size is 32 bytes (*), the L2 cache miss latency is ~28-35 clocks (*).

The instruction and data L1 TLBs are both 16-entry, fully associative. The L2 TLB is 64-entry, 2-way set associative. Miss latency ~50+ clocks (*).

MSRs are provided for observing cache line age and TLB entry age.

The LX has the limitation that it can only execute 3 concurrent prefetches (see "This reflects the internal limit of the LX of 3 outstanding prefetch transactions").

(*) Memory subsystem data was measured with The Calibrator. Download [1] from [2]. Replace all "round" with "round2" in the source file. Compile with "gcc calibrator.c -o calibrator -lm" then run with "./calibrator 433 40M something". The L2 cache is reported as 192KB (64K L1 + 128K L2 since it is a victim cache).

The DDR memory as in the XO (DDR333-166Mhz) has an achievable read speed of 600MB/sec in optimal conditions: sequential access, 8 byte reads (MMX), prefech 64 bytes ahead (2x cache line size). It means that on average the Geode LX can read a new cache line every ~20+ cycle. See test5.

Extra debug registers

In addition to the normal x86 debug registers, the Geode has a few extra. This allows for a fifth data breakpoint and an opcode breakpoint. The opcode breakpoint is particularly interesting, allowing a debugger to break in on the execution of a particular instruction. The ptrace() system call does not currently support these extra debug registers, and would need a patch. It is likely that one would want to patch gdb as well.

FPU

The FPU is an out-of-order execution unit which processes MMX and 3DNow instructions as well. The information about the FPU in the databook is not only sparse but somewhere misleading and sometimes totally wrong!!! Out-of-order here means that a single precision FP instruction (FADD for example) takes 7 cycles to execute (listed as 1 cycle in the databook), an MMX instruction (PADDB for example) takes 6 cycles to execute (listed as 2 cycles in the databook) and a 3DNow instruction (PFADD for example) takes 8 cycles to execute (listed as 2 cycles in the databook). These full instruction lengths (latency) are not listed in the databook, but measured in test403b-e and test405c-d. The 1 and 2 cycles listed in the databook are throughputs in the undocumented FPU pipeline, which means that if you run your program on a big enough buffer (so consecutive elements can be processed in parallel) then you can achieve almost the listed cycle counts.

Because the FPU runs asynchronously with the IU, every data exchange requires synchronization, which can consume a LOT of cycles (this is nowhere documented but measured):

  1. If the IU stores to memory and it is later used for an FP/MMX/3DNow! load then this FPU instruction must be scheduled 1, 2 or 4 cycles after the IU store (there must be 0, 1 or 3 nops between them, test403a-405d). 3 nops will induce 1 clock penalty for the IU, 5+ nops will idle the FPU.
  2. If the FPU loads from memory then the instruction which uses the target register (mm0-7 or ST(0-7)) must be scheduled 2 cycles after the FPU load (there must be 1 nop between them, test403a-405d). 0 nop will induce 1 clock penalty for the IU, 3+ nops will idle the FPU.
  3. If the FPU stores to memory and it is loaded by the IU then this load instruction must be scheduled 8 cycles after the MMX/3DNow! store (movq) and 14 cycles after an FP store (fstps). In case of fistpl it means 17 cycles (7, 13 and 16 nops respectively). Delaying the load for fewer cycles will stall the IU, and if the delay is exactly 7, 13 and 16 cycles (6, 12 and 15 nops) then the IU will take an additional 1 clock penalty. It is measured in test403a-405d.
  4. If the FPU loads from IU registers (movd, pinsrw) or stores to IU registers (movd, pextrw) then this synchronization results in the stalling of the IU and no matter what number is listed in the databook, the actual running cycle count will be much higher!!! (See synchronized ops.)

Note that these cycle counts and delays does not mean that I know the FPU pipeline and it really takes that many cycles to execute an instruction, it just means that from the perspective of the IU it looks like that (and that is what is needed for assembly optimization).

Note also that the FPU is a single precision unit and calculating for double precision takes a lot more cycles, so if speed is important you should set single precision at the beginning of your program.

 void set_single_fpu_precision()
 {
   short temp = 0;
   // disable double precision
   __asm__ __volatile__(
   "fstcw %[temp] \n\t"
   "andw $0xfcff,%[temp] \n\t"
   "fldcw %[temp] \n\t"
   :
   : [temp] "o" (temp)
   : "memory"
   );
 }

MMX

The MMX operations are special because there are synchronized ones which use 32 bit IU registers directly (I invented this name for them). They are either Synchronized Loads (SL) or Synchronized Stores (SS).

Synchronized Loads (SL)

These load an 32 bit reg into an MMX reg. It seems that these ops can only be scheduled only if the following two conditions are met, otherwise they stall the IU (test 301a-j):

  1. When there are no dependent ops running (it means that there must not be running any FPU instruction whose source or target operand is the SL's target register).
  2. There are no other SL ops running (does not matter what registers would they use)

MOVD takes 5 cycles to schedule and 0 to execute (so the IU executes for 5 cycles and seems not to block any pending FPU instructions).

PINSRW takes 5 cycles to schedule and 4 to execute (so the IU executes for 5 cycles and after that it steals 4 cycles from executing FPU instructions). The IU can schedule not dependent normal MMX ops after these ops.

Synchronized Stores (SS).

These store part of an MMX reg into an 32 bit reg. These ops stall the IU until their source values are calculated in the MMX queue and after the source is accessible they take 10 (TEN!!!) cycles to execute. MOVD and PEXTRW are tested, PMOVMSKB is not tested. It seems that the MMX queue keeps running while this op is executing.

TBD: test PMOVMSKB? or is it useless enough?

In practice it means that:

  1. You really do not want to use the result of MMX in integer ops. In other words, if you want to use MMX because there are too few integer regs or there is a fancy MMX op which does something useful then you must forget it!
  2. You really want to use MMX only in streaming data (large enough buffer). If you really need results to be loaded into 32 bit regs then you have to process a large enough buffer and in the second pass load memory into 32 bit regs and use them that time!
  3. You have to think about that certain algorithms can be faster if you implement them with integer instructions instead of MMX.
  4. You have to interleave MMX/integer instructions to fully utilize the Geode. You can do it by scheduling up to 6 MMX instruction before the integer ones or by just interleaving them.
  5. You have to create at least 3 orthogonal dependency chains.

Test results

The numbers shown here are calculated from test program runs:

Notes

  • TBD: To Be Done
  • TBDWGRM: To Be Done When Get a Real Machine

References

(broken link? possible newer version here)

See also

External links