User:Bluefoxicy/gcc optimizations: Difference between revisions
Bluefoxicy (talk | contribs) No edit summary |
Bluefoxicy (talk | contribs) No edit summary |
||
Line 47: | Line 47: | ||
I tried out POVRAY rendering povbench to test these on a realistic workload. The actual INI and POV files from the POVRAY site are used. |
I tried out POVRAY rendering povbench to test these on a realistic workload. The actual INI and POV files from the POVRAY site are used. |
||
These tests take about a day to run each. The CPU gets hot enough to almost burn your hand if you touch it for an extended period; but just barely, and only on parts where your skin is very thin (like fingertips). |
|||
On the Athlon 64 with ''-O2'' and ''-march=pentiumpro'', ''-fno-tree-pre'' slows things down. |
On the Athlon 64 with ''-O2'' and ''-march=pentiumpro'', ''-fno-tree-pre'' slows things down. Testing on the Geode itself shows different results; while the Athlon 64 lost about 17%, POVRAY picked up about 30%. This shows that the Geode doesn't respond the same way performance wise to the code changes. |
||
==== -O2 and PentiumPro ==== |
==== -O2 and PentiumPro ==== |
||
Line 69: | Line 69: | ||
==== -O2 and PentiumPro with -fno-tree-pre ==== |
==== -O2 and PentiumPro with -fno-tree-pre ==== |
||
Using ''-fno-tree-pre'' on POVRAY produced the below response. The Parse Time is 21.1% faster; Photon Time is 31.6% faster; Render Time is 34.8% faster; and Total Time is 34.8% faster. The difference in user time is 34.8%. |
|||
'''This test is still running'''; however, below are the numbers for parsing and photon time collected so far. It's currently rendering. It looks as if it will come out 30% faster. |
|||
$ MYCFLAGS="-march=pentiumpro -m3dnow -mmmx -O2 -fno-tree-pre" |
$ MYCFLAGS="-march=pentiumpro -m3dnow -mmmx -O2 -fno-tree-pre" |
||
$ make CFLAGS="$MYCFLAGS" CXXFLAGS="$MYCXXFLAGS" |
$ make CFLAGS="$MYCFLAGS" CXXFLAGS="$MYCXXFLAGS" |
||
Total Scene Processing Times |
|||
0:00:39 Creating light buffers 2299K tokens |
|||
Parse Time: 0 hours 0 minutes 39 seconds (39 seconds) |
|||
0:19:15 Sorting photons 63227 of 63228 |
|||
Photon Time: 0 hours 19 minutes 19 seconds (1159 seconds) |
|||
Render Time: 18 hours 52 minutes 30 seconds (67950 seconds) |
|||
Total Time: 19 hours 12 minutes 28 seconds (69148 seconds) |
|||
real 69149.38 |
|||
user 69095.17 |
|||
sys 2.98 |
|||
==== -O2 and PentiumPro with -ffast-math ==== |
|||
The FPU is crippled and I am enjoying torturing the dev board, so I hav produced the -ffast-math test below, as well as the following ''-ffast-math'' ''-fno-tree-pre'' test. POVRAY is very much FPU intensive so ''-ffast-math'' results will be highly exaggerated. |
|||
$ MYCFLAGS="-march=pentiumpro -m3dnow -mmmx -O2 -ffast-math" |
|||
$ make CFLAGS="$MYCFLAGS" CXXFLAGS="$MYCXXFLAGS" |
|||
== Other Thoughts == |
== Other Thoughts == |
||
Note that this isn't actual FPU or Integer operation performance; but rather |
Note that this isn't actual FPU or Integer operation performance; but rather code using such operations, which could be affected by how loops are scheduled or the way FPU insns are scheduled et al. I have no real clue HOW ''nbench'' comes up with these numbers, so I am taking it as just general code flow. It appears, then, that some code goes up in performance by 8.95%, while other code goes down by 0.836%. |
||
Exact implications I am unable to measure. Somebody should decode a JPEG image and render a complex Web page and play an Ogg Vorbis file with ''-O2'' and ''-O2 -fno-tree-pre'' being used to compile ALL code involved along the way. The benchmark for these operations will be precisely how much real time they take, which is measurable with the 'time' utility (as long as you complete each operation in series and then exit immediately when finished). Hand-written assembly code does not count, it doesn't get optimized by these, so disable it when taking any such measurements. |
Exact implications I am unable to measure. Somebody should decode a JPEG image and render a complex Web page and play an Ogg Vorbis file with ''-O2'' and ''-O2 -fno-tree-pre'' being used to compile ALL code involved along the way. The benchmark for these operations will be precisely how much real time they take, which is measurable with the 'time' utility (as long as you complete each operation in series and then exit immediately when finished). Hand-written assembly code does not count, it doesn't get optimized by these, so disable it when taking any such measurements. |
||
The POVRAY tests show that POVRAY is a lot faster on Geode with ''-fno-tree-pre''; other applications may vary, but I have no reason to believe that PRE is a major win on a normal desktop CPU and a major lose on Geode for POVRAY only. |
|||
=== Tuning === |
=== Tuning === |
||
Vladimir Makarov has created a gcc patch for the Geode and [http://gcc.gnu.org/ml/gcc-patches/2006-08/msg00452.html posted it to the mailing list]. His patch enables MMX, 3DNow, 3DNow-A, and SSE Prefix insns. AMD's data book doesn't give specific tuning information; however, his patch does perform just as good as PentiumPro and produce smaller code by 5-7%. |
Vladimir Makarov has created a gcc patch for the Geode and [http://gcc.gnu.org/ml/gcc-patches/2006-08/msg00452.html posted it to the mailing list]. His patch enables MMX, 3DNow, 3DNow-A, and SSE Prefix insns. AMD's data book doesn't give specific tuning information; however, his patch does perform just as good as PentiumPro and produce smaller code by 5-7%. |
Revision as of 14:47, 4 October 2006
Examining the chart of optimization results (ODS), you might notice immediately that some optimizations make things faster while others make things slower. Eventually I found that -fno-tree-pre made certain code faster without sacrificing much out of other code; #gcc on OFTC says this is because the redundancy analyzer on x86 is "stupid" and so Partial Redundancy Elimination does not work right in some cases on x86.
This was actually noticed by contrapositive; it wasn't that -fno-tree-pre made things faster, but -ftree-pre made things slower. -ftree-pre is enabled by default at -O2 but not at -Os; I had a hunch -Os would be faster due to cache limitations, and noticed immediately that nbench gave a much higher "Integer Index" for this.
Some digging through optimizations that -Os turns off and I found that -ftree-pre was enabled by -O2; I switched it on to see what happened in only a couple tests. Later I noticed ALL of those tests suffered significantly on "Integer Index," which is what is slower in -O2. I switched gears and kicked -fno-tree-pre on while building -O2; results were extremely pleasing.
Important: Deeper investigation shows that on a normal x86, disabling -ftree-pre causes big slow downs. I am testing specifically on the Geode GX using povbench (this takes over a day on the dev board). It is possible that nbench sucks and produces poor results; it is also possible the Geode's behavior causes a difference in performance. Weighing on nbench in this case would be improper.
See also [Optimization on an Ubuntu desktop machine] for more results, some of them wildly conflicting with those here.
Benchmarking
Nbench
Initial Results
-march=i586 -O2 Memory Index: 0.728 Integer Index: 0.693 Floating Point Index: 0.837
Notice that the Integer Index is somewhat low.
-Os Results
-march=i586 -Os Memory Index: 0.719 Integer Index: 0.736 Floating Point Index: 0.804
Good gains on Integer Index, but we lose on everything else.
Final Results
-march-i586 -O2 -fno-tree-pre Memory Index: 0.728 Integer Index: 0.755 Floating Point Index: 0.830
The Memory Index is the same; but we get a 0.062 increase in Integer Index, giving a performance increase of (0.062/0.693) or 8.95%.
The Floating Point Index drops by 0.007, giving a performance decrease of (0.007/0.837) or 0.836%.
POVbench
I tried out POVRAY rendering povbench to test these on a realistic workload. The actual INI and POV files from the POVRAY site are used.
These tests take about a day to run each. The CPU gets hot enough to almost burn your hand if you touch it for an extended period; but just barely, and only on parts where your skin is very thin (like fingertips).
On the Athlon 64 with -O2 and -march=pentiumpro, -fno-tree-pre slows things down. Testing on the Geode itself shows different results; while the Athlon 64 lost about 17%, POVRAY picked up about 30%. This shows that the Geode doesn't respond the same way performance wise to the code changes.
-O2 and PentiumPro
This test is the baseline test. It uses -O2 optimization and PentiumPro architecture, as well as enabling 3DNow. MMX is enabled by default for the PentiumPro but we enable it anyway. The build was as follows:
$ MYCFLAGS="-march=pentiumpro -m3dnow -mmmx -O2" $ make CFLAGS="$MYCFLAGS" CXXFLAGS="$MYCXXFLAGS"
Test results:
Total Scene Processing Times Parse Time: 0 hours 0 minutes 55 seconds (55 seconds) Photon Time: 0 hours 29 minutes 59 seconds (1799 seconds) Render Time: 28 hours 56 minutes 5 seconds (104165 seconds) Total Time: 29 hours 26 minutes 59 seconds (106019 seconds) real 106019.49 user 105977.32 sys 12.90
-O2 and PentiumPro with -fno-tree-pre
Using -fno-tree-pre on POVRAY produced the below response. The Parse Time is 21.1% faster; Photon Time is 31.6% faster; Render Time is 34.8% faster; and Total Time is 34.8% faster. The difference in user time is 34.8%.
$ MYCFLAGS="-march=pentiumpro -m3dnow -mmmx -O2 -fno-tree-pre" $ make CFLAGS="$MYCFLAGS" CXXFLAGS="$MYCXXFLAGS"
Total Scene Processing Times Parse Time: 0 hours 0 minutes 39 seconds (39 seconds) Photon Time: 0 hours 19 minutes 19 seconds (1159 seconds) Render Time: 18 hours 52 minutes 30 seconds (67950 seconds) Total Time: 19 hours 12 minutes 28 seconds (69148 seconds) real 69149.38 user 69095.17 sys 2.98
-O2 and PentiumPro with -ffast-math
The FPU is crippled and I am enjoying torturing the dev board, so I hav produced the -ffast-math test below, as well as the following -ffast-math -fno-tree-pre test. POVRAY is very much FPU intensive so -ffast-math results will be highly exaggerated.
$ MYCFLAGS="-march=pentiumpro -m3dnow -mmmx -O2 -ffast-math" $ make CFLAGS="$MYCFLAGS" CXXFLAGS="$MYCXXFLAGS"
Other Thoughts
Note that this isn't actual FPU or Integer operation performance; but rather code using such operations, which could be affected by how loops are scheduled or the way FPU insns are scheduled et al. I have no real clue HOW nbench comes up with these numbers, so I am taking it as just general code flow. It appears, then, that some code goes up in performance by 8.95%, while other code goes down by 0.836%.
Exact implications I am unable to measure. Somebody should decode a JPEG image and render a complex Web page and play an Ogg Vorbis file with -O2 and -O2 -fno-tree-pre being used to compile ALL code involved along the way. The benchmark for these operations will be precisely how much real time they take, which is measurable with the 'time' utility (as long as you complete each operation in series and then exit immediately when finished). Hand-written assembly code does not count, it doesn't get optimized by these, so disable it when taking any such measurements.
The POVRAY tests show that POVRAY is a lot faster on Geode with -fno-tree-pre; other applications may vary, but I have no reason to believe that PRE is a major win on a normal desktop CPU and a major lose on Geode for POVRAY only.
Tuning
Vladimir Makarov has created a gcc patch for the Geode and posted it to the mailing list. His patch enables MMX, 3DNow, 3DNow-A, and SSE Prefix insns. AMD's data book doesn't give specific tuning information; however, his patch does perform just as good as PentiumPro and produce smaller code by 5-7%.
The Geode GX handles the SSE prefetch instructions PREFETCHNTA, PREFETCHT0, PREFETCHT1, PREFETCHT2; but gcc doesn't know this without Makarov's patch. Geode GX doesn't handle any other SSE insns.
I have not examined gcc to see if it makes any decisions based on the size of cache or the number of TLB entries. If it does, it should know there's 16KiB I1 and 16KiB D1 4-way set associative 32-byte cacheline L1, no L2, 8 ITLB and 8 DTLB L1 TLB entries, and 64 L2 TLB entries.
The Geode processor core itself is derived from the Cyrix MediaGX processor, and so it may be possible to tune the code for that processor and have better performance than simple PentiumPro tuned code. Makarov has tuned the Geode GX target somewhat, but has stated that there is not enough documentation to do detailed tuning.