docs:
Assume that a loop with an exit will eventually take the exit and not loop indefinitely. This allows the compiler to remove loops that otherwise have no side-effects, not considering eventual endless looping as such.
This option is enabled by default at -O2 for C++ with -std=c++11 or higher.
The rationale of the default 'dynamic' is to do some multiprocessing in a first local optimisation step. This might lead to failures on link time (rare!) but is also not optimal.
More details:
https://gcc.gnu.org/onlinedocs/gccint/LTO-Overview.html
docs:
Constructs webs as commonly used for register allocation purposes and assign each web individual pseudo register. This allows the register allocation pass to operate on pseudos directly, but also strengthens several other optimization passes, such as CSE, loop optimizer and trivial dead code remover. It can, however, make debugging impossible, since variables no longer stay in a “home register”.
docs state that it should be active by default if -funroll-loops, which we don't use.
docs:
Attempt to avoid false dependencies in scheduled code by making use of registers left over after register allocation. This optimization most benefits processors with lots of registers. Depending on the debug information format adopted by the target, however, it can make debugging impossible, since variables no longer stay in a “home register”.
Enabled by default with -funroll-loops.
gcc's behaivor is again not following the documentation, which lets assume that it would use dynamic by default. By default gcc 12 uses here unlimited which does not do any estimation how costly it is vs the benefit.
Docs:
Alter the cost model used for vectorization of loops marked with the OpenMP simd directive. The model argument should be one of ‘unlimited’, ‘dynamic’, ‘cheap’. All values of model have the same meaning as described in -fvect-cost-model and by default a cost model defined with -fvect-cost-model is used.
the behaivor of gcc 12 isn't following the docs here. Regardless of the -O2/3 or -march=native/generic/x86-64-v3 it's always set to 'one'. Docs state it should be set to mixed if optimizations are on.
Docs:
Use specified regions for the integrated register allocator. The region argument should be one of the following:
‘all’
Use all loops as register allocation regions. This can give the best results for machines with a small and/or irregular register set.
‘mixed’
Use all loops except for loops with small register pressure as the regions. This value usually gives the best results in most cases and for most architectures, and is enabled by default when compiling with optimization for speed (-O, -O2, …).
‘one’
Use all functions as a single region. This typically results in the smallest code size, and is enabled by default for -Os or -O0.
docs:
Detect paths that trigger erroneous or undefined behavior due to a null value being used in a way forbidden by a returns_nonnull or nonnull attribute. Isolate those paths from the main control flow and turn the statement with erroneous or undefined behavior into a trap. This is not currently enabled, but may be enabled by -O2 in the future.
docs:
Perform interprocedural pointer analysis and interprocedural modification and reference analysis. This option can cause excessive memory and compile-time usage on large compilation units. It is not enabled by default at any optimization level.
docs:
Generate code to automatically split the stack before it overflows. The resulting program has a discontiguous stack which can only overflow if the program is unable to allocate any more memory. This is most useful when running threaded programs, as it is no longer necessary to calculate a good stack size to use for each thread. This is currently only implemented for the x86 targets running GNU/Linux.
When code compiled with -fsplit-stack calls code compiled without -fsplit-stack, there may not be much stack space available for the latter code to run. If compiling all code, including library code, with -fsplit-stack is not an option, then the linker can fix up these calls so that the code compiled without -fsplit-stack always has a large stack. Support for this is implemented in the gold linker in GNU binutils release 2.21 and later.
docs:
Enable the identity transformation for graphite. For every SCoP we generate the polyhedral representation and transform it back to gimple. Using -fgraphite-identity we can check the costs or benefits of the GIMPLE -> GRAPHITE -> GIMPLE transformation. Some minimal optimizations are also performed by the code generator isl, like index splitting and dead code elimination in loops.
docs:
Consider that instructions that may throw exceptions but don’t otherwise contribute to the execution of the program can be optimized away. This does not affect calls to functions except those with the pure or const attributes. This option is enabled by default for the Ada and C++ compilers, as permitted by the language specifications. Optimization passes that cause dead exceptions to be removed are enabled independently at different optimization levels.
docs:
SMS is intended to schedule instructions of loops rather than the traditional scheduler (in GCC) that does not give a special handling for loops. For more information on the theory behind SMS take a look at the 2004 GCC summit proceedings (page 55). This optimization helps in loops where there is a place to run consecutive iterations concurrently but the traditional instruction scheduling is not able to fully utilize the hardware functional units. This optimization is disabled by default because of compile time consumption; -fmodulo-sched activates it.
Source: https://gcc.gnu.org/news/sms.html
docs:
Enable register pressure sensitive insn scheduling before register allocation. This only makes sense when scheduling before register allocation is enabled, i.e. with -fschedule-insns or at -O2 or higher. Usage of this option can improve the generated code and decrease its size by preventing register pressure increase above the number of available hard registers and subsequent spills in register allocation.
This is enabled with -O3, but only for some targets. Seems to be off for x86_64.
docs:
Use IRA to evaluate register pressure in loops for decisions to move loop invariants. This option usually results in generation of faster and smaller code on machines with large register files (>= 32 registers), but it can slow the compiler down.
docs:
I'd recommend to use this at
least for x86/x86-64. I think any OOO processor with small or
moderate register file which does not use the 1st insn scheduling
might benefit from this too.
On SPEC2000 for x86/x86-64 (I use Haswell processor, -O3 with
general tuning), the optimization usage results in smaller code size
in average (for floating point and integer benchmarks in 32- and
64-bit mode). The improvement better visible for SPECFP2000 (although
I have the same improvement on x86-64 SPECInt2000 but it might be
attributed mostly mcf benchmark unstability). It is about 0.5% for
32-bit and 64-bit mode. It is understandable, as the optimization has
more opportunities to improve the code on longer BBs. Different from
other heuristic optimizations, I don't see any significant worse
performance. It gives practically the same or better performance (a
few benchmarks imporoved by 1% or more upto 3%).
The single but significant drawback is additional compilation time
(4%-6%) as the 1st insn scheduling pass is quite expensive.
Source of docs: https://gcc.gnu.org/legacy-ml/gcc-patches/2013-11/msg00420.html
docs:
When -fgcse-las is enabled, the global common subexpression elimination pass eliminates redundant loads that come after stores to the same memory location (both partial and full redundancies).
docs:
When -fgcse-sm is enabled, a store motion pass is run after global common subexpression elimination. This pass attempts to move stores out of loops. When used in conjunction with -fgcse-lm, loops containing a load/store sequence can be changed to a load before the loop and a store after the loop.