I spend quite a lot of time trying to write fast code, but it is something of an uphill battle with Microsoft's Visual Studio C++ compiler.

Here is an example of how MSVC struggles with member variable reads and writes in some cases.

Consider the following code. In the method writeToArray(), we write the value 'v' into the array pointed to by the member variable 'a'. The length of the array is stored in the member variable 'N'.

class CodeGenTestClass
		N = 1000000;
		a = new TestPair[N];
		v = 3;

	__declspec(noinline) void writeToArray()
		for(int i=0; i<N; ++i)
			a[i].first = v;

	__declspec(noinline) void writeToArrayWithLocalVars()
		TestPair* a_ = a; // Load into local var
		int v_ = v; // Load into local var
		const int N_ = N; // Load into local var

		for(int i=0; i<N_; ++i)
			a_[i].first = v_;

	TestPair* a;
	int v;
	int N;
Compiler is Visual Studio 2015, x64 target with /O2.

The disassembly for writeToArray() looks like this:

		for(int i=0; i<N; ++i)
0000000140055350  xor         r8d,r8d  
0000000140055353  cmp         dword ptr [rcx+0Ch],r8d  
0000000140055357  jle         js::CodeGenTestClass::writeToArray+28h (0140055378h)  
0000000140055359  mov         r9d,r8d  
000000014005535C  nop         dword ptr [rax]  
			a[i].first = v;
0000000140055360  mov         rdx,qword ptr [rcx]       // Load this->a
0000000140055363  lea         r9,[r9+8]  
0000000140055367  mov         eax,dword ptr [rcx+8]     // Load this->v into eax
000000014005536A  inc         r8d  
000000014005536D  mov         dword ptr [r9+rdx-8],eax  // Store value in eax into array
0000000140055372  cmp         r8d,dword ptr [rcx+0Ch]   // Load this->N and compare with loop index.
0000000140055376  jl          js::CodeGenTestClass::writeToArray+10h (0140055360h)  
0000000140055378  ret  

I have bolded the inner loop and added some comments.

Rcx here is storing the 'this' pointer. What you can see is that inside the loop, the values of 'a', 'v', and 'N' are repeatedly loaded from memory, which is wasteful.

Let's compare with the disassembly for writeToArrayWithLocalVars():

		TestPair* a_ = a; // Load into local var
		int v_ = v; // Load into local var
		const int N_ = N; // Load into local var

		for(int i=0; i<N_; ++i)
0000000140054AE0  movsxd      rdx,dword ptr [rcx+0Ch]  
0000000140054AE4  xor         eax,eax  
0000000140054AE6  mov         r8,qword ptr [rcx]  
0000000140054AE9  mov         r9d,dword ptr [rcx+8]  
0000000140054AED  test        rdx,rdx  
0000000140054AF0  jle         js::CodeGenTestClass::writeToArrayWithLocalVars+1Eh (0140054AFEh)  
			a_[i].first = v_;
0000000140054AF2  mov         dword ptr [r8+rax*8],r9d  // Store value 'v' (in r9d register) into the array
0000000140054AF6  inc         rax                       // increment loop index
0000000140054AF9  cmp         rax,rdx                   // Compare loop index with N
0000000140054AFC  jl          js::CodeGenTestClass::writeToArrayWithLocalVars+12h (0140054AF2h)   // branch
0000000140054AFE  ret  

Again I have bolded the inner loop and added some comments.

As you can see, the member variables are not repeatedly loaded in the inner loop, but are instead stored in registers. This is much better, and executes faster:

test_class.writeToArray():              0.000541 s (1.84977 B writes/sec)
test_class.writeToArrayWithLocalVars(): 0.000380 s (2.63310 B writes/sec)

Needless to say Clang gets this right, here's the inner loop for writeToArray(): (see https://godbolt.org/g/juzpfV)

.LBB1_4:                                # =>This Inner Loop Header: Depth=1
        mov     dword ptr [rdx + 8*rsi], ecx
        mov     dword ptr [rdx + 8*rsi + 8], ecx
        mov     dword ptr [rdx + 8*rsi + 16], ecx
        mov     dword ptr [rdx + 8*rsi + 24], ecx
        mov     dword ptr [rdx + 8*rsi + 32], ecx
        mov     dword ptr [rdx + 8*rsi + 40], ecx
        mov     dword ptr [rdx + 8*rsi + 48], ecx
        mov     dword ptr [rdx + 8*rsi + 56], ecx
        add     rsi, 8
        cmp     rsi, r8
        jl      .LBB1_4

Why this happens

It's hard to say for sure without seeing the source code for MSVC. But I think it's probably a failure of alias analysis.

Basically, a C++ compiler has to assume the worst, in particular it must assume that any pointer can be pointing at anything else in the program memory space, unless it can prove that it is not possible under the rules of the language (e.g. would be undefined behaviour).

In this particular case, we have two pointers in play - the 'this' pointer, and the 'a' pointer, and since we have a write through the 'a' pointer, it looks like MSVC is unable to determine that 'a' does not point to 'v', or 'this', or 'N'.

To be able to prove that a write through 'a' does not overwrite the value in N, or V, MSVC needs to be able to do what is called alias analysis. I believe in this case it would be best done with type-based alias analysis (TBAA).

Since in C++ it is undefined behaviour to write through a pointer with one type (in this case TestPair*), and read through a pointer with another type (CodeGenTestClass* for the this pointer?), therefore the write to this->a cannot store a value that is read from this->v or this->N. Unfortunately MSVC's TBAA is either absent or not strong enough to work this out.

(I may be wrong about the analysis pass required here, compiler experts please feel free to correct me!).

Moving the values into local variables as in writeToArrayWithLocalVars(), allows the compiler to determine that they are not aliasing. (It can determine this quite simply by noting that the address of the local variables is never taken, therefore no aliasing pointers can point at them) This allows the values to be placed into registers.

One thing to note is that MSVC can do the aliasing analysis and produce a fast loop when the 'a' array is of a simple type such as int instead of TestPair. (Edit: Actually this is not the case, MSVC fails in this case also)


This kind of code is pretty common in C++. I extracted this particular example from some hash table code I was writing.

You will see this kind of problem with MSVC whenever you are writing to and reading from member variables. (Depending on the exact types etc..). So I would say this is a pretty serious performance/codegen problem for MSVC.

Edit: Comment thread on reddit. In this comment Gratian Lup clarifies that MSVC does not do TBAA.