User:Bluefoxicy/gcc optimizations: Difference between revisions
Bluefoxicy (talk | contribs) No edit summary |
Bluefoxicy (talk | contribs) No edit summary |
||
Line 1: | Line 1: | ||
Examining the [[:Media:Gcc_optchart.ods|chart of optimization results]], you might notice immediately that some optimizations make things faster while others make things slower. Eventually I found that ''-fno-tree-pre'' made certain code faster without sacrificing much out of other code; #gcc on OFTC says this is because the redundancy analyzer on x86 is "stupid" and so Partial Redundancy Elimination does not work right. |
Examining the [[:Media:Gcc_optchart.ods|chart of optimization results]] (ODS), you might notice immediately that some optimizations make things faster while others make things slower. Eventually I found that ''-fno-tree-pre'' made certain code faster without sacrificing much out of other code; #gcc on OFTC says this is because the redundancy analyzer on x86 is "stupid" and so Partial Redundancy Elimination does not work right. |
||
This was actually noticed by contrapositive; it wasn't that ''-fno-tree-pre'' made things faster, but ''-ftree-pre'' made things slower. ''-ftree-pre'' is enabled by default at ''-O2'' but not at ''-Os''; I had a hunch ''-Os'' would be faster due to cache limitations, and noticed immediately that ''nbench'' gave a much higher "Integer Index" for this. |
This was actually noticed by contrapositive; it wasn't that ''-fno-tree-pre'' made things faster, but ''-ftree-pre'' made things slower. ''-ftree-pre'' is enabled by default at ''-O2'' but not at ''-Os''; I had a hunch ''-Os'' would be faster due to cache limitations, and noticed immediately that ''nbench'' gave a much higher "Integer Index" for this. |
Revision as of 23:17, 30 September 2006
Examining the chart of optimization results (ODS), you might notice immediately that some optimizations make things faster while others make things slower. Eventually I found that -fno-tree-pre made certain code faster without sacrificing much out of other code; #gcc on OFTC says this is because the redundancy analyzer on x86 is "stupid" and so Partial Redundancy Elimination does not work right.
This was actually noticed by contrapositive; it wasn't that -fno-tree-pre made things faster, but -ftree-pre made things slower. -ftree-pre is enabled by default at -O2 but not at -Os; I had a hunch -Os would be faster due to cache limitations, and noticed immediately that nbench gave a much higher "Integer Index" for this.
Some digging through optimizations that -Os turns off and I found that -ftree-pre was enabled by -O2; I switched it on to see what happened in only a couple tests. Later I noticed ALL of those tests suffered significantly on "Integer Index," which is what is slower in -O2. I switched gears and kicked -fno-tree-pre on while building -O2; results were extremely pleasing.
Initial Results
-march=i586 -O2 Memory Index: 0.728 Integer Index: 0.693 Floating Point Index: 0.837
Notice that the Integer Index is somewhat low.
-Os Results
-march=i586 -Os Memory Index: 0.719 Integer Index: 0.736 Floating Point Index: 0.804
Good gains on Integer Index, but we lose on everything else.
Final Results
-march-i586 -O2 -fno-tree-pre Memory Index: 0.728 Integer Index: 0.755 Floating Point Index: 0.830
The Memory Index is the same; but we get a 0.062 increase in Integer Index, giving a performance increase of (0.062/0.693) or 8.95%.
The Floating Point Index drops by 0.007, giving a performance decrease of (0.007/0.837) or 0.836%.
Note that this isn't actual FPU or Integer operation performance; but rather operations based on such operations. I have no real clue HOW nbench comes up with these numbers, so I am taking it as just general code flow. It appears, then, that some code goes up in performance by 8.95%, while other code goes down by 0.836%.
Exact implications I am unable to measure. Somebody should decode a JPEG image and render a complex Web page and play an Ogg Vorbis file with -O2 and -O2 -fno-tree-pre being used to compile ALL code involved along the way. The benchmark for these operations will be precisely how much real time they take, which is measurable with the 'time' utility (as long as you complete each operation in series and then exit immediately when finished). Hand-written assembly code does not count, it doesn't get optimized by these, so disable it when taking any such measurements.