docs:
SMS is intended to schedule instructions of loops rather than the traditional scheduler (in GCC) that does not give a special handling for loops. For more information on the theory behind SMS take a look at the 2004 GCC summit proceedings (page 55). This optimization helps in loops where there is a place to run consecutive iterations concurrently but the traditional instruction scheduling is not able to fully utilize the hardware functional units. This optimization is disabled by default because of compile time consumption; -fmodulo-sched activates it.
Source: https://gcc.gnu.org/news/sms.html
docs:
Enable register pressure sensitive insn scheduling before register allocation. This only makes sense when scheduling before register allocation is enabled, i.e. with -fschedule-insns or at -O2 or higher. Usage of this option can improve the generated code and decrease its size by preventing register pressure increase above the number of available hard registers and subsequent spills in register allocation.
This is enabled with -O3, but only for some targets. Seems to be off for x86_64.
docs:
Use IRA to evaluate register pressure in loops for decisions to move loop invariants. This option usually results in generation of faster and smaller code on machines with large register files (>= 32 registers), but it can slow the compiler down.
docs:
I'd recommend to use this at
least for x86/x86-64. I think any OOO processor with small or
moderate register file which does not use the 1st insn scheduling
might benefit from this too.
On SPEC2000 for x86/x86-64 (I use Haswell processor, -O3 with
general tuning), the optimization usage results in smaller code size
in average (for floating point and integer benchmarks in 32- and
64-bit mode). The improvement better visible for SPECFP2000 (although
I have the same improvement on x86-64 SPECInt2000 but it might be
attributed mostly mcf benchmark unstability). It is about 0.5% for
32-bit and 64-bit mode. It is understandable, as the optimization has
more opportunities to improve the code on longer BBs. Different from
other heuristic optimizations, I don't see any significant worse
performance. It gives practically the same or better performance (a
few benchmarks imporoved by 1% or more upto 3%).
The single but significant drawback is additional compilation time
(4%-6%) as the 1st insn scheduling pass is quite expensive.
Source of docs: https://gcc.gnu.org/legacy-ml/gcc-patches/2013-11/msg00420.html
docs:
When -fgcse-las is enabled, the global common subexpression elimination pass eliminates redundant loads that come after stores to the same memory location (both partial and full redundancies).
docs:
When -fgcse-sm is enabled, a store motion pass is run after global common subexpression elimination. This pass attempts to move stores out of loops. When used in conjunction with -fgcse-lm, loops containing a load/store sequence can be changed to a load before the loop and a store after the loop.