Why can't compilers optimize code perfectly?

What every programmer should know about compiler optimizations

Compiler

  • 24 minutes to read

Hadi Brais

Download the code samples

High-level programming languages ​​provide numerous abstract programming constructs, e.g. B. Functions, conditional statements and loops that make for amazing productivity. However, one disadvantage of writing code in a high level programming language is the potentially significant decrease in performance. Ideally, you should write code that is understandable and easy to maintain - without compromising performance. Because of this, compilers try to optimize your code automatically to improve its performance, and they are currently doing this in a very sophisticated way. You can transform loops, conditional statements, and recursive functions, eliminate entire blocks of code, and use the target instruction set architecture (ISA) to make your code quick and compact. It is far more advisable to focus on writing code that is understandable than doing manual tweaks that result in cryptic code that is difficult to maintain. If you optimize the code manually, it can even prevent the compiler from doing additional or more efficient optimizations.

Rather than manually tweaking the code, rethink aspects of your design; B. the use of faster algorithms, the integration of parallelism at the thread level and the use of framework-specific functions (e.g. the use of motion constructors).

This article is about Visual C ++ compiler optimizations. I present the most important optimization techniques and show the decisions a compiler must make in order to apply them. My intention is not to show you how to optimize the code manually, but to convince you why you can trust the compiler to optimize the code. Under no circumstances is this article an examination of the optimizations made by the Visual C ++ compiler. However, it does show the optimizations that you really should know about, as well as communicating with the compiler to get them applied.

There are other important optimizations that are currently beyond the capabilities of any compiler - e. B. replacing an inefficient algorithm with an efficient one or changing the layout of a data structure to improve its locality. However, such optimizations are beyond the scope of this article.

Define Compiler Optimizations

Optimization consists of transforming a section of code into another functionally equivalent section of code in order to optimize at least one of its characteristics. The two most important characteristics are the speed and the size of the code. Other features include: For example, the amount of time it takes to run the code, the time it takes to compile the code, and (if the code requires just-in-time) compilation, the time it takes to JIT the code.

Compilers are constantly improving in the techniques they use to optimize the code. However, they are not perfect. However, it is not recommended to spend time manually optimizing a program. In general, it makes a lot more sense to use certain functions provided by the compiler and let the compiler optimize your code.

There are four ways you can help the compiler optimize your code more efficiently:

  1. Write code that is understandable and easy to maintain. Don't see the object-oriented features of Visual C ++ as the enemy of performance. The latest version of Visual C ++ can keep this overhead to a minimum and sometimes even eliminate it entirely.
  2. Use compiler directives. Assign the compiler e.g. For example, consider using a function calling convention that is faster than the standard convention.
  3. Use in-compiler functions. An internal system function is a special function whose implementation is automatically provided by the compiler. The compiler knows the function exactly and replaces the function call with an extremely efficient instruction sequence that the target ISA uses. Microsoft .NET Framework does not currently support system-internal functions. Therefore, they are not supported by any of the managed languages. However, Visual C ++ has extensive support for this feature. Note that intrinsic features can improve the performance of your code, but also decrease its readability and portability.
  4. Use profile-guided optimization (PGO). With this technique, the compiler knows better how the code will behave at runtime and can optimize it accordingly.

The aim of this article is to show you why you can trust the compiler by showing the optimizations that are performed on inefficient but understandable code (using the first method). It also gives you a brief introduction to Profile Driven Optimization, and I introduce some of the compiler directives that you can use to fine-tune some pieces of code.

Numerous compiler optimization techniques are available from simple transformations (e.g. constant convolution) to extreme transformations (e.g. instruction scheduling). This article covers just a few of the most important optimizations - the optimizations that can significantly improve performance (by a double-digit percentage) and reduce code size: inline substitution of functions, COMDAT optimizations, and loop optimizations. The first two options are explained in the next section. Then I'll show you how to control the optimizations performed by Visual C ++. Finally, let's take a quick look at optimizations in the .NET Framework. In this article, I'll only use Visual Studio 2013 to create the code.

Code generation at link time

Link-Time Code Generation (LTCG) is a technique for performing Whole Program Optimizations (WPO) on C / C ++ code. The C / C ++ compiler compiles each source file separately and generates the corresponding object file. This means that the compiler can only apply optimizations to a single source file and not to the entire program. However, some important optimizations can only be performed on the program as a whole. You can apply these optimizations at link time rather than compile time because the linker has a full view of the program.

If LTCG is activated (by specifying the compiler switch “/ GL”), the compiler driver (“cl.exe”) only calls the front end of the compiler (“c1.dll” or “c1xx.dll”) and postpones the tasks of the back end ("c2.dll") up to the link time. The resulting object files contain C Intermediate Language (CIL) code and no computer-dependent assembly code. When the linker ("link.exe") is then called, it recognizes that the object files contain CIL code and calls the back end of the compiler, which in turn executes WPO, creates the binary object files and returns them to the linker runs to merge all of the object files and generate the executable file.

The front end even does some optimizations regardless of whether optimizations are enabled or disabled, e.g. B. Constant Convolution. However, all major optimizations are performed by the back end of the compiler and can be controlled using compiler switches.

LTCG enables the backend to aggressively execute optimizations (by specifying “/ GL” together with the compiler switches “/ O1” or “/ O2” and “/ Gw” and the linker switches “/ OPT: REF” and “/ OPT : ICF "). In this article I only cover function inline substitution and COMDAT optimizations. For a full list of LTCG optimizations, see the documentation. Note that the linker can run LTCG on native object files, mixed native / managed object files, pure managed object files, securely managed object files, and secure .netmodules.

I will create a program consisting of two source files (“source1.c” and “source2.c”) and a header file (“source2.h”). The files “source1.c” and “source2.c” are stored in illustration 1 or. Figure 2 shown. The header file, which contains the prototypes of all functions in “source2.c”, is quite simple. I am therefore not showing them here.

Figure 1 - The file "source1.c"

Figure 2 - The file "source2.c"

The source1.c file contains two functions: the square function, which takes an integer and returns its square, and the main function of the program. The main function calls the square function and all functions of “source2.c” with the exception of “isPrime”. The file "source2.c" contains five functions: The cube function returns the cube of a specified integer value, the sum function returns the sum of all integer values ​​from 1 to a specified integer value, the sumOfcubes function returns the sum of the cubes returns all integer values ​​from 1 to a specified integer value, the isPrime function determines whether a specified integer value is a prime number, and the getPrime function returns the xth prime number. I didn't include error checking because it's of no interest in this article.

The code is simple and useful at the same time. There are several functions that perform simple calculations as these are required for loops in some cases. The getPrime function is the most complex function because it contains a while loop and in this loop calls the isPrime function, which also contains a loop. I'm using this code to show one of the major compiler optimizations, which is function inline substitution, as well as some other optimizations.

I built the code under three different configurations and examined the results to see how the compiler transformed it. If you want to follow this example, you need the assembler output file (generated with the compiler switch "/ FA [s]") to examine the resulting assembly code, as well as the mapping file (generated with the link switch "/ MAP"), to determine the COMDAT optimizations that were performed (the linker can also record this if you use the "/ verbose: icf" and "/ verbose: ref" switches). So make sure these switches are specified in all of the following configurations that I am describing. I also use the C compiler ("/ TC") so that the generated code can be examined more easily. However, all statements relating to this also apply to C ++ code.

The debug configuration

The debug configuration is used primarily because all back-end optimizations are disabled if you specify the / Od compiler switch without specifying the / GL switch. When you build your code with this configuration, the resulting object files contain binary code that exactly matches the source code. You can examine the resulting assembly language output files and mapping file to confirm this statement. This configuration corresponds to the debug configuration of Visual Studio.

The release configuration "Code generation at compile time"

This configuration is similar to the release configuration, in which optimizations (by specifying the compiler switch “/ O1”, “/ O2” or “/ Ox”) are activated without specifying the compiler switch / GL. In this configuration, the resulting object files contain optimized binary code. However, no optimizations are carried out at the level of the overall program.

If you examine the generated assembly listing file from source1.c, you will see that two optimizations have been performed. The first call to the square function ("square (n)") in illustration 1 was completely removed by evaluating the calculation at compile time. Why is this so? The compiler has determined that the square function is small and should therefore be replaced inline. After the inline replacement, the compiler determined that the value of the local variable "n" is known and does not change between the assignment statement and the function call. From this it was concluded that it is safe to perform the multiplication and replace the result (25). In the second optimization, the second call of the square function ("square (m)") was also replaced inline. However, since the value of "m" is not known at compile time, the compiler cannot evaluate the calculation. Hence the actual code is output.

Now examine the assembly listing file from source2.c, which is much more interesting. The call of the cube function in "sumOfCubes" has been replaced inline. This, in turn, enables the compiler to do significant optimizations on the loop (this is described in the “Loop Optimizations” section). Also, the SSE2 instruction set in the isPrime function is used to convert from “int” to “double” when the sqrt function is called and also to convert from “double” to “int” when the Return of "sqrt" takes place. “Sqrt” is only called once before the loop is started. Note that no "/ arch" switch is specified for the compiler. The x86 compiler uses SSE2 by default. Most of the x86 processors provided, as well as all x86-64 processors, support SSE2.

The release configuration "Code generation at link time"

The LTCG release configuration is identical to the release configuration in Visual Studio. Optimizations are activated in this configuration and the compiler switch “/ GL” is specified. This switch is specified implicitly when "/ O1" or "/ O2" is used. It tells the compiler to output CIL object files instead of assembly object files. This is how the linker calls the back end of the compiler to run WPO as previously described. Below I describe various WPO optimizations to demonstrate the tremendous benefits of LTCG. The assembly code listings generated with this configuration are available online.

As long as the inline replacement of functions is activated (“/ Ob” - this option is always activated when you request optimizations), the switch “/ GL” enables the compiler to inline functions that are defined in other translation units. This happens regardless of whether the "/ Gy" switch (described below) is specified. The link switch "/ LTCG" is optional and only provides instructions for the linker.

If you examine the assembly listing file of source1.c, you will find that all function calls except for scanf_s have been replaced inline. Because of this, the compiler was able to do the calculations for “cube”, “sum” and “sumOfCubes”. Only the “isPrime” function was not replaced inline. However, if this was replaced manually in "getPrime" inline, the compiler would still replace "getPrime" in the main function inline.

As you can see, function inlining is important not only because it removes a function call and thus optimizes it, but also because it enables the compiler to do many other optimizations as a result. Inlineing functions usually improves performance. However, the code size also increases. If this optimization is used too often, a phenomenon called "code bloat" occurs. The compiler carries out a cost-benefit analysis at each call position and then decides whether the function should be inline replaced.

Because of the importance of inlineing functions, the Visual C ++ compiler provides much more support than the standard inline control. You can instruct the compiler never to inline a range of functions by using the auto_inline pragma. You can tell the compiler never to inline a particular function by using the __declspec (noinline) flag.You can tag a function with the keyword "inline" to provide a hint to the compiler to inline a function (however, the compiler can ignore this hint if the bottom line is that the inline substitution is a loss). The inline keyword has been available since the first version of C ++ - it was introduced in C99. You can use the Microsoft-specific keyword "__inline" in C and C ++ code. This is useful if you are using an old version of C that does not support this keyword. You can also use the __forceinline keyword (C and C ++) to force the compiler to always inline functions when it can. Finally, you can tell the compiler to unfold a recursive function to a specified or indefinite depth by using inline replacement using the inline_recursion pragm. Note that the compiler does not currently provide any functions that allow control of inline substitution at the call position rather than at the function definition.

The "/ Ob0" switch completely deactivates inline replacement and takes effect by default. You should use this switch when debugging (it is automatically specified in the Visual Studio debug configuration). The "/ Ob1" switch instructs the compiler to only consider functions for inline replacement that are marked with "inline", "__inline", or "__forceinline". The switch "/ Ob2", which takes effect when "/ O [1 | 2 | x]" is specified, instructs the compiler to consider any functions for inline replacement. In my opinion, the only reason to use the “inline” or “__inline” keywords is to control inline substitution with the “/ Ob1” switch.

Under certain circumstances, the compiler cannot inline a function. This is the case, for example, when a virtual function is called up. The function cannot inline because the compiler may not know which function is being called. Another example is calling a function using a pointer to the function instead of its name. You should avoid such circumstances whenever possible to allow for inline substitution. For a complete list of such conditions, see the MSDN documentation.

Functional inlining isn't the only optimization that's more effective when applied at the program-wide level. In fact, most optimizations become more effective at this level. Later in this section, I'll introduce a special class of such optimizations, called COMDAT optimizations.

By default, when you compile a translation unit, all code is saved in one section in the resulting object file. The linker works at the section level: this means that it can remove, combine and rearrange sections. This prevents the linker from doing three optimizations that can significantly reduce the size of the executable (double digit percentage) and improve its performance. The first optimization is to eliminate unreferenced functions and global variables. The second optimization consists in convolving identical functions and constant global variables. The third optimization is to rearrange functions and global variables so that the functions that fall under the same execution path and the variables that are accessed together are physically closer together in memory to increase locality.

To enable these linker optimizations, you can instruct the compiler to place functions and variables in separate sections by specifying the compiler switch "/ Gy" (link at function level) or "/ Gw" (global data optimization). Such sections are known as COMDATs. You can also mark a specific global data variable with “__declspec (selectany)” to instruct the compiler to place the variable in a COMDAT. Then, if you specify the / OPT: REF linker option, the linker removes unreferenced functions and global variables. If you also specify the linker option "/ OPT: REF", the linker folds identical functions and global constant variables. (ICF is the abbreviation for “Identical COMDAT Folding”.) With the link switch “/ ORDER” you can instruct the linker to place COMDATs in a certain order in the resulting image. Note that all of these optimizations are linker optimizations and do not require the “/ GL” compiler switch. The switches "/ OPT: REF" and "/ OPT: ICF" should be deactivated when debugging for obvious reasons.

LTCG should always be used whenever possible. The only time that LTCG should not be used is if you plan to distribute the resulting object and library files. Remember that these files contain CIL code, not assembly code. CIL code can only be processed by a compiler / linker that is the same version as the production compiler / linker. This can severely limit the usability of the object files because developers must have the same version of the compiler to use these files. In this case, you should use compile-time code generation instead if you are not ready to distribute the object files for each compiler version. In addition to their restricted usability, these object files are many times larger than the corresponding assembler object files. However, also keep in mind the great advantage of CIL object files, which is in enabling WPO.

Loop optimizations

The Visual C ++ compiler supports several loop optimizations. However, I'll only introduce three of them: loop unrolling, automatic vectorization, and loop-invariant code shifting. If you put the code in illustration 1 change so that "m" is passed to "sumOfCubes" instead of "n", the compiler cannot determine the value of the parameter and must therefore compile the function in order to process arguments. The resulting function is highly optimized, and it is quite large. Therefore, the compiler does not perform inline replacement.

If the code is compiled with the switch "/ O1", the result is assembly code that is optimized in terms of size. In this case, no optimizations are performed on the sumOfCubes function. If the code is compiled with the "/ O2" switch, the result is code that is optimized for speed. The code will be considerably larger, but also considerably faster, because the loop within "sumOfCubes" was unrolled and vectorized. It is important to understand that without inlining the cube function, vectorization would not be possible. Also, loop unrolling would not be as effective without inline replacement. A simplified graphical representation of the resulting assembly code is provided in Figure 3 shown. The flowchart is identical for x86 and x86-64 architectures.


Figure 3 - Control flow diagram for "sumOfCubes"

In Figure 3 the green diamond is the entry point and the red rectangles represent the exit points. The blue diamonds represent conditions that are executed at runtime as part of the sumOfCubes function. If SSE4 is supported by the processor and "x" is greater than or equal to eight, SSE4 instructions are used to perform four multiplications at the same time. The process of performing the same operation on multiple values ​​at the same time is called vectoring. In addition, the compiler unrolls the loop twice. This means that the body of the loop is repeated twice in each iteration. The result of the combination is that eight multiplications are carried out for each iteration. When “x” becomes less than eight, traditional instructions are used to do the rest of the calculations. Notice that the compiler has issued three exit points that contain separate epilogues in the function instead of just one epilogue. In this way, the number of jumps is reduced.

Loop unrolling is the process of repeating the body of the loop in the loop so that multiple iterations of the loop are performed in one iteration of the unfurled loop. The reason this improves performance is because the loop control statements run less frequently. Perhaps more importantly, the compiler should be able to perform numerous other optimizations, e.g. B. Vectorization. The disadvantage of loop unrolling is the increase in code size and register printing. However, depending on the loop body, performance may be improved by a double-digit percentage.

In contrast to x86 processors, all x86-64 processors support SSE2. You can also use the AVX / AVX2 instruction sets of the current x86-64 microarchitectures from Intel and AMD by specifying the "/ arch" switch. If “/ arch: AVX2” is specified, the compiler can also use the FMA and BMI instruction sets.

Currently, the Visual C ++ compiler does not give you control over loop unrolling. However, you can emulate this technique by using templates with the __ forceinline keyword. You can disable automatic vectorization for a specific loop using the loop pragma without the no_vector option.

If you examine the generated assembly code, on closer inspection you will notice that the code can be tweaked a little more. However, the compiler has already done an excellent job and does not spend any more time analyzing the code and applying minor tweaks.

"SomeOfCubes" isn't the only function whose loop has been unrolled. If you change the code to pass “m” to the sum function instead of “n”, the compiler cannot evaluate the function and must therefore output its code. In this case the loop is unrolled twice.

The final optimization I'll cover is loop-invariant code shifting. Consider the following code snippet:

The only change here is that there is an additional variable that is incremented in each iteration and then output. It is easy to see that this code can be optimized by moving the increment of the variable “count” out of the loop. This means that “x” can simply be assigned to the variable “count”. This optimization is known as "loop-invariant code shift". The loop-invariant part clearly states that this technique only works if the code doesn't depend on expressions in the loop header.

Here's a catch: if you apply this optimization manually, the resulting code may experience degraded performance under certain conditions. Can you see why this is so? Think about what happens if “x” is not a positive value. The loop never runs. This means that the variable "count" is not processed in the non-optimized version. In the manually optimized version, however, there is an unnecessary assignment of "x" to "count", which is executed outside of the loop! If "x" was negative, "count" also contains the wrong value. Humans as well as compilers are prone to such pitfalls. Fortunately, the Visual C ++ compiler is smart enough to detect this by printing out the condition of the loop before assigning it. This results in improved performance for all values ​​of "x".

In summary, if you are not an expert on compilers or compiler optimizations, you should avoid manually transforming your code just to make it look faster. Stay away from it and trust the compiler to optimize your code.

Controlling optimizations

The compiler switches “/ O1”, “/ O2” and “/ Ox” can be used to control optimizations for certain functions. However, you can also use the optimization pragma as shown in the following example:

The optimization list can be empty or contain at least one of the following values: “g”, “s”, “t”, and “y”. These correspond to the compiler switches “/ Og”, “/ Os”, “/ Ot” or “/ Oy”.

An empty list with the parameter "off" has the effect that all these optimizations are deactivated regardless of the compiler switches that were specified. An empty list with the "on" parameter causes the specified compiler switches to take effect.

The switch "/ Og" enables global optimizations. These can be accomplished by just examining the function that is being optimized (not the functions that call it). When LTCG is enabled, “/ Og” enables WPO.

The optimization pragma works well when you want to optimize different functions in different ways - some for size and some for speed. However, if you really want that level of control, consider profile-guided optimization (PGO). It optimizes the code using a profile that contains behavioral information recorded during the execution of an instrumented version of the code. The compiler uses the profile to make better decisions about how to optimize the code. Visual Studio provides the tools necessary to apply this technique to native and managed code.

Optimizations in .NET

There is no linker involved in the .NET compilation model. However, a source code (C #) compiler and a JIT compiler are available. The source code compiler only performs minor optimizations. He leads z. B. no inline replacement for functions and no loop optimizations. Instead, these optimizations are taken over by the JIT compiler. The JIT compiler, which is included with all versions of the .NET Framework up to version 4.5, does not support SIMD instructions. The JIT compiler that ships with the .NET Framework 4.5.1 or higher (called RyuJIT) supports SIMD.

What are the differences between RyuJIT and Visual C ++ in terms of the optimization functions? Because RyuJIT does the processing at runtime, the compiler can perform optimizations that are not possible in Visual C ++. RyuJIT can e.g. For example, it might determine that the condition of an if statement in that particular execution of the application is never true and can therefore be removed and the code optimized in this way. RyuJIT can also take advantage of the functions of the processor that runs the compiler. If the processor supports SSE4.1, the JIT compiler outputs e.g. B. only executes SSE4.1 statements for the sumOfcubes function and thus leads to more compact code being generated. However, he cannot spend a lot of time optimizing the code because the time it takes to compile the JIT has a negative impact on the performance of the application. On the other hand, the Visual C ++ compiler can spend much more time identifying and using further optimization possibilities. An impressive new technology from Microsoft called .NET Native allows you to compile managed code into stand-alone executables that are optimized using the Visual C ++ back end. Currently, this technology only supports Windows Store apps.

The ability to control managed code optimizations is currently limited. The C # and Visual Basic compilers only offer the option of activating or deactivating optimizations using the "/ optimize" switch. To control JIT optimizations, you can apply the System.Runtime.CompilerServices.MethodImpl attribute to a method with an option from MethodImplOptions. The NoOptimization option disables optimizations, the NoInlining option prevents the method from inlining, and the AggressiveInlining (.NET 4.5) option provides the JIT compiler with a recommendation (more than just a hint) for inline replacement the method available.

Summary

All of the optimization techniques discussed in this article can significantly improve the performance of your code by a double-digit percentage, and all of them are supported by the Visual C ++ compiler. These techniques are important because using them enables the compiler to perform further optimizations.Under no circumstances is this article a comprehensive examination of the compiler optimizations performed by Visual C ++. However, I hope you got a taste of the compiler's capabilities. Visual C ++ can do even more - a lot more - so look forward to Part 2.


Hadi Braisis a PhD student at the Indian Institute of Technology Delhi (IITD). He researches compiler optimizations for next generation storage technology. He spends much of his time writing code in C / C ++ / C # and analyzing the CLR and CRT. You can find his blog at hadibrais.wordpress.com. Contact him at [email protected]

Thanks to the following Microsoft technical expert for reviewing this article: Jim Hogg