How does a compiler know about APIs

Compiler Optimizations - Native profile-driven optimization of code

  • 19 minutes to read

September 2015

Volume 30 number 9

By Hadi Brais | September 2015

Often times, a compiler makes incorrect optimization decisions that do not really improve the performance of the code execution or, in the worst case, make it worse. The optimizations described in the first two articles are critical to the performance of your application.

This article introduces an important technique called Profile Directed Optimization (PGO) that can be used by the compiler back-end to optimize code more efficiently. In experiments, performance increases of 5-35% were achieved. In addition, if used carefully, this technique will in no way affect the performance of your code.

This article builds on the first two parts (msdn.microsoft.com/magazine/dn904673 and msdn.microsoft.com/magazine/dn973015). If you are not familiar with the concept of profile-driven optimization, I recommend that you first read the Visual C ++ team blog post at bit.ly/1fJn1DI.

Introduction to profile-driven optimization

One of the most important optimizations a compiler performs is inlining functions. By default, the Visual C ++ compiler will inline a function as long as the caller doesn't get too big. Many function calls are expanded, but this is only useful if the call is made frequently. Otherwise, it just grows code, wasting instruction and unified cache space and increasing the app's working set. But how does the compiler know whether the call is made frequently? It ultimately depends on the arguments passed to the function.

Most optimizations lack the reliable heuristics necessary to make correct decisions. I've seen many cases of bad register allocation that have resulted in significant performance degradation. As you compile your code, all you can do is hope that any performance increases and decreases from all optimizations will ultimately lead to a positive overall result; H. lead to more speed. This is almost always the case, but it can result in an overly large executable file.

It would be nice if there weren't any such side effects. If you could tell the compiler how the code behaves at runtime, it could optimize the code better. The process of recording information about program behavior at runtime is called profiling, and the information that is generated is called a profile. You can provide one or more profiles for the compiler to use to control its optimizations. This is what program-controlled optimization (PGO) is all about.

You can use this technique in native and managed code. However, the tools differ, which is why I will only deal with the native PGO here and the managed PGO in another article. The remainder of this section describes how to apply profile driven optimization to an application.

Profile-driven optimization is an excellent technique. But like everything else, it also has disadvantages. It takes a lot of time (depending on the size of the app) and effort. Fortunately, as you'll see later, Microsoft has tools that can greatly reduce the time it takes to apply the PGO to an app. There are three phases to applying the PGO to an application: the instrumentation build, the training, and the PGO build.

The instrumentation build

There are several methods of profiling a running program. The Visual C ++ compiler uses static binary instrumentation, which generates the most accurate profiles but takes longer. Using instrumentation, the compiler inserts a small number of machine instructions into points of interest in all functions of your code (see illustration 1). These commands are recorded when the associated piece of code is executed and add this information to the generated profile.


Figure 1: The instrumentation build of an app with profile-driven optimization

There are several steps to building an instrumented version of an app. First you need to compile all source code files with the "/ GL" switch to enable Whole Program Optimization (WPO). The WPO is required in order to instrument the program (which is not technically required, but helps make the profile that is created more useful). Only files that were compiled with "/ GL" are instrumented.

To make the next phase go as smoothly as possible, avoid compiler switches that lead to additional code. For example, deactivate the inlining of functions (/ Ob0). Also disable security (/ GS) and runtime checks (no / RTC). So, you shouldn't use the standard Release and Debug modes of Visual Studio. Optimize the speed (/ O2) for files that were not compiled with "/ GL". Include at least / Og for instrumented code.

Then link the generated object files and required static libraries with the "/ LTCG: PGI" switch. This causes the linker to perform three tasks. It instructs the compiler back end to instrument the code and generate a PGO database (PGD) file. This is used in the third phase to save all profiles. At this point the PGD file does not contain any profiles. It only contains information for determining which object files are used, in order to see whether they have changed at the time the PGD file is used. By default, the PGD file takes on the name of the executable file. You can also use the optional link switch "/ PGD" to specify a PGD file name. The third task is to link the import library "pgort.lib". The output executable depends on the PGO runtime DLL "pgortXXX.dll", where XXX is the version of Visual Studio.

The result of this phase is an executable file (EXE or DLL) bloated with instrumentation code and an empty PGD file, which is to be filled in and used in the third phase. An instrumented static library is only possible if this library is linked to a project to be instrumented. In addition, the same version of the compiler must generate all CIL-OBJ files, otherwise the linker will return an error.

Profiling reviews

Before moving on to the next phase, I want to address the code that the compiler inserts to profile the code. This way you can estimate the amount of processing that will be added to your program and understand the information that will be gathered at run time.

To capture a profile, the compiler inserts a number of checks into any function compiled with "/ GL". A check is a small sequence of statements (two to four statements) consisting of several push statements and a call statement for a check handler at the end. If necessary, a review of two function calls to save and restore all of the XMM registers is included. There are three types of checks:

  • Checks of Count: This is the most common type of check. This counts the number of times a block of code is executed by incrementing a counter with each execution. This check has the least amount of effort in terms of size and speed. Each counter has a size of 8 bytes under x64 and 4 bytes under x86.
  • Input check: The compiler adds an input check at the beginning of each function. The purpose of this test is to instruct the other checks in the same function to use the counters associated with that function. This is necessary because review handlers are shared across functions. The PGO runtime is initialized by checking the input of the "main" function. An incoming check is also a count check. This is the slowest check.
  • Value checks: These checks are inserted before all virtual function calls and "switch" statements and are used to record a histogram of values. A value check is also a count check because it counts the number of times a value is present. This review is greatest.

A function is not instrumented by a check if it has only one basic block (a sequence of instructions with an input and an output). In fact, it is set inline despite the "/ Ob0" switch. In addition to checking the value, each "switch" statement causes the compiler to create a constant COMDAT section that describes it. The size of this section is roughly equal to the number of occurrences times the size of the variable that controls the switch.

Each verification ends with a call to the verification handler. The input validation of the "main" function creates a vector (8 bytes for x64 and 4 bytes for x86) of pointers from validation handlers, with each input pointing to a different validation handler. In most cases, there are few validation handlers. Checks are inserted in each role in the following places:

  • An input check at the entrance of the function
  • A count check in each basic block that ends with a call or a "ret" statement
  • A value check just before each "switch" statement
  • A value check directly before each virtual function call

The amount of memory overhead of the instrumented program is determined by the number of checks, the number of occurrences in all "switch" statements, the number of "switch" statements and the number of virtual function calls.

All validation handlers increment a counter at some point to record the execution of the appropriate block of code. The compiler uses the ADD instruction to increment a 4-byte counter by 1, and under x64 the ADC instruction to add the "carry" flag to the high 4 bytes of the counter. These instructions are not thread safe. This means that all checks are not thread safe by default. If at least one of the functions can be performed by multiple threads at the same time, the results will not be reliable. In this case you can use the link switch "/ pogosafemode". This causes the compiler to prefix these statements with LOCK, which makes all checks thread-safe. However, this also makes them slower. Unfortunately, this feature cannot be applied selectively.

If your application consists of multiple projects, the output of which is either an EXE or DLL file for profile driven optimization, you will need to repeat the process for each one.

The training phase

After the first phase, you will have an instrumented version of the executable and a PGD file. In the second phase, the training takes place, in which the executable file generates one or more profiles for storage in a separate PGC file (PGO Count). You use these files in the third phase to optimize the code.

This is the most important phase as profile accuracy is critical to the success of the whole process. For a profile to be useful, it must reflect a general scenario in which the program is used. The compiler optimizes the program, provided the tested scenarios are general. If it did not, the program may be performing worse in practice. A profile generated on the basis of a general usage scenario helps the compiler determine the slowest paths to optimize the speed and the fastest paths to optimize the size (see Figure 2).


Figure 2: The training phase when creating a PGO app

The complexity of this phase depends on the number of usage scenarios and the type of program. Training is easy when the application does not require user input. If there are many usage scenarios, sequentially generating a profile for each scenario may not be the fastest way to go.

In the complex training scenario in Figure 2 pgosweep.exe is a command line tool that allows you to control the contents of the profile that the PGO runtime maintains when it is run. You can create several instances of the program and apply usage scenarios at the same time.

Imagine you have two instances running in processes X and Y. If any pre-starting scenario is for process X, call "pgosweep" and pass it to the process id and the Switch "/ onlyzero". This causes the PGO runtime to delete the part of the in-memory profile only for this process. Without the process ID, the entire PGC profile is deleted. Then the scenario can be started. You can trigger usage scenario 2 for process Y in a similar manner.

The PGC file is generated when all executed instances of the program have ended. However, if the program has a long startup time and you don't want to run it for every scenario, you can force the runtime to generate a profile and delete the in-memory profile to prepare it for another scenario in the same run. To do this, run "pgosweep.exe" and pass in the process ID, the name of the executable file and the name of the PGC file.

By default, the PGC file is generated in the same directory as the executable file. You can change this using the VCPROFILE_PATH environment variable, which must be set before running the first instance of the program.

I addressed the data and instruction overhead of instrumenting code. In most cases, this additional effort can be managed. By default, the memory usage of the PGO runtime does not exceed a certain threshold. If it turns out that more memory is required, an error occurs. In this case you can use the environment variable VCPROFILE_ALLOC_SCALE to increase this threshold.

The PGO build

After you've run through all of the common usage scenarios, you will have a number of PGC files that you can use to create the optimized version of the program. You can discard PGC files that you do not want to use.

The first step in creating the PGO version is to merge all of the PGC files using a command line utility called "pgomgr.exe". You can also use it to edit a PGD file. To merge the two PGC files in the PGD file generated in the first phase, run "pgomgr" and pass the "/ merge" switch to the PGD file. This will merge all PGC files in the current directory whose names match the name of the specified PGD file, followed by "! #" And a number. The compiler and linker can use the resulting PGD file to optimize the code.

With the tool "pgomgr" you can capture a more general and more important usage scenario. To do this, transfer the relevant PGC file name and the "/ merge: n" switch. "n" is a positive integer that indicates the number of copies of the PGC file to be included in the PGD file. By default, "n" is 1. This multiplicity means that a certain profile influences the optimizations to its advantage.

The second step is to run the linker, passing in the same set of object files as in phase 1. This time use the / LTCG: PGO switch. The linker looks for a PGD file with a name that matches the executable file in the current directory. The linker ensures that the CIL-OBJ files have not changed since the PGD file was generated in phase 1 and then passes them to the compiler to optimize the code. This process is carried out in Figure 3 shown. You can use the link switch "/ PGD" to explicitly specify a PGD file. Don't forget to enable feature inlining for this phase.


Figure 3: The PGO build in phase 3

Most compiler and linker optimizations are profile driven. The result of this phase is an executable file that is highly optimized in terms of size and speed. It is now a good idea to measure performance gains.

Manage the code base

If you make any changes to the input files passed to the linker with the "/ LTCG: PGI" switch, the linker refuses to use the PGD file if "/ LTCG: PGO" is specified. The reason is that such changes can significantly affect the usefulness of the PGD file.

One possibility is to repeat the three phases described above to create another compatible PGD file.However, if the changes were minor (such as adding a small number of functions, calling a function less or more frequently, or perhaps adding a feature that is not used as often) then it is convenient to repeat the whole process. In this case you can use the "/ LTCG: PGU" switch instead of the "/ LTCG: PGO" switch. This instructs the linker to skip compatibility checks for the PGD file.

These small changes accumulate over time. You will ultimately reach a point where it is beneficial to re-instrument the application. You can determine when you have reached this point by looking at the compiler output when you do the PGO build for the code. You will learn how much of the code base the PGD file covers. If the profile coverage falls below 80% (see Figure 4), it is a good idea to re-instrument the code. However, this percentage is highly dependent on the type of application.

The PGO in action

The PGO controls optimizations that are used by the compiler and linker. I'm using the NBody simulator to demonstrate some of its benefits. You can download this application from bit.ly/1gpEaCY. You will also need to download and install the DirectX SDK at bit.ly/1LQnKge to compile the application.

First, I compile the application in release mode to compare it to the PGO version. To build the PGO version of the application, you can use the Profile Driven Optimization menu item in the Visual Studio Build menu.

You should also activate the assembler output with the compiler switch "/ FA [c]" (do not use "/ FA [c] s" for this demo). With this simple application, it is sufficient to train the instrumented app once to generate a PGC file and use it to optimize the app. This gives you two executables: one blind optimized and a second PGO optimized. Make sure you have access to the final PGD file as you will need it later.

If you now run both executables one after the other and compare the GFLOP values ‚Äč‚Äčachieved, you will see that both performed similarly. Apparently, applying the PGO to the app was a waste of time. Upon closer inspection, it turns out that the size of the app went from 531 KB (for the blind optimized app) to 472 KB (for the PGO-based application), i. H. by 11%. So applying the PGO to this app caused it to be scaled down for the same performance. Why is this so?

Take a look at the 200-line "DXUTParseCommandLine" function in the "DXUT / Core / DXUT. CPP" file. If you look at the generated assembly code of the release build, you can see that the size of the binary code is approximately 2700 bytes. On the other hand, the size of the binary code in the PGO build is larger than 1650 bytes. You can find the cause of this difference in the assembly statement, which checks the condition of the following loop:

The blind optimized build generated the following code:

The PGO build, however, generated the following code:

Many users prefer to specify parameters using the graphical user interface rather than using the command line. Therefore, according to the profile information, the usual scenario here is that the loop never goes through. Without a profile, the compiler cannot possibly know. Therefore, he is now doing everything possible to optimize the code in the loop. In the process, many functions are expanded, which leads to pointless code bloat. During the PGO build, you provided the compiler with a profile that said the loop never ran. Because of this, the compiler knew there was no point in inlining functions that are called in the body of the loop.

Another interesting difference can be seen from the assembly code snippets. In the blindly optimized executable, the branch that is seldom executed is in the fall-through path of the conditional statement. The branch that is almost always taken is 800 bytes away from the conditional statement. Not only does this fail to predict the processor branch, it is guaranteed to cause an instruction cache error.

The PGO build avoided both of these problems by swapping the positions of the branches. In fact, the infrequently executed branch has been moved to a separate section of the executable, improving the locality of the working set. This optimization is known as the separation of "dead" code. This would have been impossible without a profile. Functions rarely called, e.g. For example, small differences in binary code can cause significant differences in performance.

When creating the PGO code, the compiler shows you how many functions of all the instrumented functions were compiled in terms of speed. The compiler also shows you this in the output windows of Visual Studio. No more than 10% of the functions would typically be compiled for speed (think aggressive inlining) while the rest would be compiled for size (imagine partial or no inlining).

Let us consider a more interesting function, "DXUTStaticWndProc", which is defined in the same file. The functions that control the structure are as follows:

The blind optimized code outputs each code block in the same order as in the source code. However, the code in the PGO build has been cleverly rearranged based on the frequency of execution of each block and the time each block was executed. The first two conditions were rarely executed, so the corresponding blocks of code are now in a separate section to improve cache and memory usage. In addition, the functions that were recognized as belonging to the slowest path (e.g. DXUTIsWindowed) are now inline:

Most optimizations benefit from a reliable profile and others can now be performed. Even if the PGO does not lead to a significant increase in performance, it certainly reduces the size of the executable files created and their processing overhead for the storage system.

PGO databases

The advantages of the PGD profile go far beyond the optimization of the compiler. While you can use pgomgr.exe to merge multiple PGC files, this file also serves a different purpose. It provides three switches that you can use to view the contents of the PGD file to better understand how your code behaves in relation to the scenarios tested. The first switch, / summary, instructs the tool to output a summary of the contents of the PGD file in text form. The second switch, / detail, together with the first switch instructs the tool to output a detailed description of the profile in text form. The last option, / unique, tells the tool to display the function names in an unaddressed form (especially useful for C ++ code bases).

Program-controlled control

There is one other feature worth mentioning. The header file "pgobootrun.h" declares a function with the name "PgoAutoSweep". You can call this function to programmatically generate a PGC file and delete the profile in memory in preparation for the next PGC file. The function accepts an argument of the type "char *" which refers to the name of the PGC file. To use this feature, you need to link to the pgobootrun.lib static library. This is currently the only programmatic assistance related to the PGO.

Summary

Profile Driven Optimization (PGO) is an optimization technique that helps the compiler and linker make better optimization decisions by referring to a reliable profile whenever a tradeoff is needed in terms of size or speed. Visual Studio provides visual access to this technique from the Build menu or the project context menu.

However, you get a wider range of features using the PGO plug-in, which you can download from bit.ly/1Ntg4Be. This is also well documented at bit.ly/1RLjPDi. If you follow the threshold coverage in Figure 4 remember, this is the easiest way to optimize using the plug-in, which is described in the documentation. However, if you prefer to use command line tools, see the article at bit.ly/1QYT5nO for numerous examples. If you have a native code base, it might be a good idea to give this a try. If you do, please let me know how it affected the size and speed of the application.


Figure 4: Maintenance cycle of the PGO code base

More resources

For more information on profile-driven database optimization, see Hadi Brais's blog post at bit.ly/1KBcffQ.


Hadi Braisis a PhD student at the Indian Institute of Technology Delhi (IITD). He researches compiler optimizations for next generation storage technology. He spends a large part of his time writing code in C / C ++ / C # and analyzing runtimes and compiler frameworks. You can find his blog at hadibrais.wordpress.com. Contact him at [email protected]

Thanks to the following Microsoft technical expert for reviewing this article: Ankit Asthana