Forward Scattering - The Weblog of Nicholas Chapman

I ran across this one today.

Microsoft's Visual Studio C++ compiler (MSVC) doesn't inline functions marked as inline, even when the function is only called once (has exactly one call site).

A little background on inliners: The inliner is the part of the compiler that inlines function calls. For every function call site (function/procedure invocation) it needs to decide whether to inline the function there, replacing the function call with the function body.

Inliners will often not inline a function if the function body is sufficiently large. Otherwise the code size will increase, resulting in slower compilation times, larger executables, and more pressure on the instruction cache.

This is all well and good, however, there should be an exception - if a function only has one call site, then the function should generally be inlined, especially if the call site is not in conditionally executed code.

Consider this program:

#include <cmath>

static inline float f(float x)
{
	// Do some random computations, to make this function sufficiently complicated
	const float y = std::sin(x)*x + std::cos(x);
	float z = 0;
	for(int z=0; z<(int)x; ++z)
		z += std::pow(x, y);
	z += y;
	return z;
}

int main(int argc, char** argv)
{
	return f(argc);
}

The body of function f will be executed regardless of whether f is inlined. The only difference is if the call instruction, function prologue and epilogue etc.. are executed. Obviously we want to avoid that if possible, therefore f should be inlined here.

The inliner should be smart enough to know that f is called exactly once, so there is no risk of code duplication. Note that f is marked as static, so the compiler doesn't have to worry about other translation units calling f.

Even with full optimisation and link-time code gen (/Ox and /LTCG) VS2015 does not inline the call to f.

Clang of course does a better job and inlines the call: godbolt.org/g/njG8aL.

VS2017 does inline f in the example above, but fails when f is made larger, so it looks like just the size threshold for inlining has changed.

This example is a toy program, but I ran across this issue in some performance-critical code I am writing, where a crucial function that only has a single call site is not being inlined.

Here are the perf results from force inlining this crucial function:

crucial function non-inlined 
---------------------
Build took 3.0740 s
Build took 3.0300 s

crucial function force-inlined:
------------------
Build took 2.8270 s	
Build took 2.8199 s

As you can see there is a significant performance boost from inlining the function.

Workaround

You can (of course) force-inline a function with __forceinline. However it would be nice if we didn't need to rely on workarounds.

Why does MSVC do this?

MSVC probably refuses to inline large functions that have just a single call-site, because it's possible that such functions are error-handling code, which are seldom-executed, for example:

res = doSomeCode();
if(res == SOME_UNLIKELY_ERROR_CODE)
  doSomeComplicatedErrorHandlingCode();

In this case we don't want to inline doSomeComplicatedErrorHandlingCode since it will just pollute the instruction cache with unexecuted code.

It's also possible that the MSVC guys just never got around to this optimisation :)

Regardless, MSVC could definitely use better heuristics to try and filter out this error handling case from other cases where it would be beneficial to inline (such as when the function is called unconditionally).

Conclusion

The MSVC inliner is missing an optimisation that can result in quite significant speedups - inlining a function when it is only called from one call site.

Also, don't believe anyone when they say the compiler is always smarter than you when it comes to deciding when to inline.

Issue filed with MS here, discussion on reddit here.