docs:
I'd recommend to use this at
least for x86/x86-64. I think any OOO processor with small or
moderate register file which does not use the 1st insn scheduling
might benefit from this too.
On SPEC2000 for x86/x86-64 (I use Haswell processor, -O3 with
general tuning), the optimization usage results in smaller code size
in average (for floating point and integer benchmarks in 32- and
64-bit mode). The improvement better visible for SPECFP2000 (although
I have the same improvement on x86-64 SPECInt2000 but it might be
attributed mostly mcf benchmark unstability). It is about 0.5% for
32-bit and 64-bit mode. It is understandable, as the optimization has
more opportunities to improve the code on longer BBs. Different from
other heuristic optimizations, I don't see any significant worse
performance. It gives practically the same or better performance (a
few benchmarks imporoved by 1% or more upto 3%).
The single but significant drawback is additional compilation time
(4%-6%) as the 1st insn scheduling pass is quite expensive.
Source of docs: https://gcc.gnu.org/legacy-ml/gcc-patches/2013-11/msg00420.html
docs:
When -fgcse-las is enabled, the global common subexpression elimination pass eliminates redundant loads that come after stores to the same memory location (both partial and full redundancies).
docs:
When -fgcse-sm is enabled, a store motion pass is run after global common subexpression elimination. This pass attempts to move stores out of loops. When used in conjunction with -fgcse-lm, loops containing a load/store sequence can be changed to a load before the loop and a store after the loop.