Forward Scattering - The Weblog of Nicholas Chapman

The weblog of Nicholas Chapman

10 years of this blog

Posted 17 Oct 2023

The first post was made to this blog on the 28th of April 2013. How time flies!

Here's the source code for it: https://github.com/Ono-Sendai/blog

It's written in C++ :)

A poor entropy coding result in Zstandard

Posted 1 Oct 2023

I was looking into compression of heightmaps for Substrata, and tried doing lossless compression in the following way: quantise terrain heights to 16 bit values, then try and predict the next value (using some previous values, see this tweet) and then encode the difference between the prediction and the actual value (this is called the prediction residual).

You end up with a bunch of residuals, that you then want to compress. I tried using zstd, and it performed pretty poorly, giving a compressed size of 1406321 B, from 2097152 B of data (1024 * 1024 * 16 bits). The exact amount depended on compression level but didn't get much better)

Since I was curious, I decided to compute the entropy of the data. (See Entropy_(information_theory) on wikipedia )

$$H=\sum_x p(x) \log_2({1 \over p(x)})$$

This entropy measure strictly speaking only applies to infinite sequences of symbols, however in our case, with a reasonably large file of residuals, and assuming each residual is uncorrelated and identically distributed, the shannon entropy gives a lower bound on the number of bits needed to encode the data: the number of symbols times the entropy of the symbol distribution. Any correlation between symbols should only help the compressor.

A good 'entropy coder' should be able to get near this limit. There will be some overhead due to having to write the symbol encoding/table, but it shouldn't be that big in my case, as there were only around 2000 unique symbols (i.e. 2000 unique 16-bit values). You can compute the entropy in C++ code like so:

static double computeEntropy(const js::Vector<int16, 16>& filtered_data)
{
	std::map<int16, size_t> count;

	for(size_t i=0; i<filtered_data.size(); ++i)
	{
		count[filtered_data[i]]++;
	}

	double entropy = 0;
	for(auto it = count.begin(); it != count.end(); ++it)
	{
		const double p = (double)it->second / (double)filtered_data.size();

		entropy += -p * std::log2(p);
	}

	return entropy;
}

For my data this gives:

Entropy: 9.115353669970485 bits
Theoretical min compressed size: 1194767.6362303714 B

So the theoretical minimum is ~1.2 MB. Zstd is only acheiving ~1.4 MB, which is nowhere near the theoretical limit.

This was quite disappointing to me, as I like zstd and use it a lot, and I know that zstd has some fancy ANS entropy encoding.

The reason for this failure was explained to my by Conor Stokes as being due to the Lempel–Ziv–Welch stage of zstd compression, that is not well suited for entropy coding - and apparently this stage introduces a lot of overhead.

Interestingly, 7-zip compresses the data much better than zstd, reaching a size of 1,298,590 B.

So zstd is not the cure-all and panacea of data compression that I had hoped :)

I would also be interested in a good C++/C entropy encoding library if anyone knows one.

Edit: Comments on Hacker News here.

Introducing Substrata

Posted 25 May 2021

Substrata is my latest virtual world / metaverse.

I've written a post about my vision / plans for it here: https://substrata.info/about_substrata

No warning for unused variable when declared const

Posted 12 Feb 2020

An issue I found with the Visual Studio 2019 C++ Compiler: https://developercommunity.visualstudio.com/content/problem/785567/no-warning-for-unused-variable-when-declared-const.html

How to fix a bug

Posted 2 Oct 2019

Lets say you encounter a bug in the software you work on. Maybe a user informs you of the bug, or you find it yourself. Let's say this bug has been in your software for a little while, and is not a bug you just added in the last hour or so. This implies that your unit tests did not catch the bug.

Here's the process I recommend for fixing the bug:

1) Work out what the problem is.

2) Work out the solution, edit your source files until the bug is fixed.

3) Since the bug was not caught by a unit test, make a unit test that will catch the bug.

4) Undo the bug fix from your source files, and make sure your new unit test catches the bug.

5) Redo the bug fix. Make sure the new unit test passes.

6) Commit the bug fix and the new unit test(s).

7) Think about how the bug got into your source code. Why wasn't it caught by unit tests? How can you improve your unit tests?

8) Think about if there are likely to be any similar bugs in your source code. Can you add unit tests for those possible bugs?

This is not an exact prescription. The ordering of the steps is flexible, and in some cases some steps may be more aspirational than realistic.

CryptoVoxels world rendered in Indigo

Posted 8 May 2019

I did some virtual wandering through the CryptoVoxels world (loaded into Indigo Renderer), taking some virtual snapshots with Indigo:

Realtime fractal program

Posted 24 Feb 2019

I open-sourced this fun little program:

https://github.com/glaretechnologies/fractal

It animates a Julia set in realtime using OpenGL shaders. It has some funky colouring techniques so looks a bit nicer than your average fractal.

Fast rotation of a vector by a quaternion

Posted 18 Jul 2018

There is a fast way to compute the rotation of a vector by a quaternion, apparently noted first by Fabian Giesen. See this blog post. Since the original description/derivation seems to have been lost in the sands of time (pages 404), I thought I would re-derive it.

In the derivation below, I will follow the notation of Ken Shoemake in Quaternions.

Suppose we have a unit quaternion $$q = [\mathbf{v}, w]$$ The rotation of a vector, $\mathbf{p}$ by the quaternion $q$ is given by $$ p' = q p q^* $$ where p is treated as the quaternion $$ [(p_x, p_y, p_z), 0] = [\mathbf{p}, 0] $$ e.g. as a quaternion with the w coord zero.

Expanding out $q p q^*$ using the definition of quaternion multiplication gives $$ p' = q p q^* = ([\mathbf{v}, w][\mathbf{p}, 0])[-\mathbf{v}, w] $$ $$ p' = ([\mathbf{v}0 + \mathbf{p}w + \mathbf{v} \times \mathbf{p}, w 0 - \mathbf{v}.\mathbf{p}])[-\mathbf{v}, w] $$ $$ p' = ([\mathbf{p}w + \mathbf{v} \times \mathbf{p}, - \mathbf{v}.\mathbf{p}]) [-\mathbf{v}, w] $$ $$ p' = [(\mathbf{p}w + \mathbf{v} \times \mathbf{p})w + (-\mathbf{v})(- \mathbf{v}.\mathbf{p}) + (\mathbf{p}w + \mathbf{v} \times \mathbf{p}) \times -\mathbf{v}), (- \mathbf{v}.\mathbf{p})w - (\mathbf{p}w + \mathbf{v} \times \mathbf{p}).(-\mathbf{v})] $$ Lets treat the w component of this first: $$ p'_w = - \mathbf{v}.\mathbf{p}w - (\mathbf{p}w).(-\mathbf{v}) + (\mathbf{v} \times \mathbf{p}).(-\mathbf{v}) $$ $$ p'_w = - w\mathbf{v}.\mathbf{p} + w\mathbf{p}.\mathbf{v} - (\mathbf{v} \times \mathbf{p}).\mathbf{v} $$ We know $\mathbf{v} \times \mathbf{p}$ is orthogonal to $ \mathbf{v} $, so $$(\mathbf{v} \times \mathbf{p}).\mathbf{v} = 0$$ Therefore $$ p'_w = 0 $$ Giving a resulting vector with zero w coordinate like we expected.

Continuing with the vector part of $p'$: $$ \mathbf{p'} = (\mathbf{p}w + \mathbf{v} \times \mathbf{p})w + (-\mathbf{v})(- \mathbf{v}.\mathbf{p}) + (\mathbf{p}w \times -\mathbf{v}) + ((\mathbf{v} \times \mathbf{p}) \times -\mathbf{v}) $$ $$ \mathbf{p'} = (\mathbf{p}w + \mathbf{v} \times \mathbf{p})w + \mathbf{v}( \mathbf{v}.\mathbf{p}) + (\mathbf{v} \times \mathbf{p})w + (\mathbf{v} \times (\mathbf{v} \times \mathbf{p})) $$ $$ \mathbf{p'} = \mathbf{p}w^2 + (\mathbf{v} \times \mathbf{p})w + \mathbf{v}( \mathbf{v}.\mathbf{p}) + (\mathbf{v} \times \mathbf{p})w + (\mathbf{v} \times (\mathbf{v} \times \mathbf{p})) $$ $$ \mathbf{p'} = \mathbf{p}w^2 + 2(\mathbf{v} \times \mathbf{p})w + \mathbf{v}( \mathbf{v}.\mathbf{p}) + (\mathbf{v} \times (\mathbf{v} \times \mathbf{p})) $$

Remember that for a unit quaternion, $\mathbf{v}$ is a (non-unit) vector around which the quaternion gives a rotation, in particular $$||\mathbf{v}|| = \sin(\Omega) $$ Where $ \Omega $ is half the angle of the quaternion rotation, e.g. $$||\mathbf{v}|| = \sin(\Omega) = \sin(\theta/2)$$ So now we decompose the vector $ \mathbf{p} $ into the vector projection and vector rejection (see wikipedia) components with respect to the unit vector $ \hat{\mathbf{v}} = \mathbf{v}/||\mathbf{v}||$, giving $$ \mathbf{p} = \hat{\mathbf{v}}(\hat{\mathbf{v}}.\mathbf{p}) + (-\hat{\mathbf{v}} \times (\hat{\mathbf{v}} \times \mathbf{p})) $$ Substituting in $$\hat{\mathbf{v}} = \mathbf{v}/||\mathbf{v}|| = { \mathbf{v} \over \sin(\theta/2) } $$ gives $$ \mathbf{p} = ({ \mathbf{v} \over \sin(\theta/2) })(({ \mathbf{v} \over \sin(\theta/2) }).\mathbf{p}) - (({ \mathbf{v} \over \sin(\theta/2) }) \times (({ \mathbf{v} \over \sin(\theta/2) }) \times \mathbf{p})) $$ $$ \mathbf{p} = { \mathbf{v}(\mathbf{v}.\mathbf{p}) \over \sin^2(\theta/2) } - { \mathbf{v} \times (\mathbf{v} \times \mathbf{p}) \over \sin^2(\theta/2) } $$ $$ \mathbf{p} = { \mathbf{v}(\mathbf{v}.\mathbf{p}) - \mathbf{v} \times (\mathbf{v} \times \mathbf{p}) \over \sin^2(\theta/2) } $$ or $$ \mathbf{v}(\mathbf{v}.\mathbf{p}) = \sin^2(\theta/2)\mathbf{p} + \mathbf{v} \times (\mathbf{v} \times \mathbf{p}) $$ Going back to our expression for $\mathbf{p'}$: $$ \mathbf{p'} = \mathbf{p}w^2 + 2(\mathbf{v} \times \mathbf{p})w + \mathbf{v}( \mathbf{v}.\mathbf{p}) + (\mathbf{v} \times (\mathbf{v} \times \mathbf{p})) $$ Substituting in our expression for $\mathbf{v}(\mathbf{v}.\mathbf{p})$ gives $$ \mathbf{p'} = \mathbf{p}w^2 + 2(\mathbf{v} \times \mathbf{p})w + (\sin^2(\theta/2)\mathbf{p} + \mathbf{v} \times (\mathbf{v} \times \mathbf{p})) + (\mathbf{v} \times (\mathbf{v} \times \mathbf{p})) $$ $$ \mathbf{p'} = \mathbf{p}(w^2 + \sin^2(\theta/2)) + 2(\mathbf{v} \times \mathbf{p})w + 2(\mathbf{v} \times (\mathbf{v} \times \mathbf{p})) $$ But for unit quaternions, the relationship between the w component and the rotation angle $\theta$ is given by $$w = \cos(\Omega) = \cos(\theta/2) $$ so $$ \mathbf{p'} = \mathbf{p}(\cos^2(\theta/2) + \sin^2(\theta/2)) + (\mathbf{v} \times \mathbf{p})w + \mathbf{v} \times (\mathbf{v} \times \mathbf{p}) $$ And of course $$ \cos^2(\theta/2) + \sin^2(\theta/2) = 1 $$ so $$ \mathbf{p'} = \mathbf{p} + 2w(\mathbf{v} \times \mathbf{p}) + 2(\mathbf{v} \times (\mathbf{v} \times \mathbf{p})) $$ which is the simple, fast expression that we are after.

You can also derive this from Rodrigues' rotation formula with some funky trig identities.

MSVC inliner and functions called just once

Posted 28 Nov 2017

I ran across this one today.

Microsoft's Visual Studio C++ compiler (MSVC) doesn't inline functions marked as inline, even when the function is only called once (has exactly one call site).

A little background on inliners: The inliner is the part of the compiler that inlines function calls. For every function call site (function/procedure invocation) it needs to decide whether to inline the function there, replacing the function call with the function body.

Inliners will often not inline a function if the function body is sufficiently large. Otherwise the code size will increase, resulting in slower compilation times, larger executables, and more pressure on the instruction cache.

This is all well and good, however, there should be an exception - if a function only has one call site, then the function should generally be inlined, especially if the call site is not in conditionally executed code.

Consider this program:

#include <cmath>

static inline float f(float x)
{
	// Do some random computations, to make this function sufficiently complicated
	const float y = std::sin(x)*x + std::cos(x);
	float z = 0;
	for(int z=0; z<(int)x; ++z)
		z += std::pow(x, y);
	z += y;
	return z;
}

int main(int argc, char** argv)
{
	return f(argc);
}

The body of function f will be executed regardless of whether f is inlined. The only difference is if the call instruction, function prologue and epilogue etc.. are executed. Obviously we want to avoid that if possible, therefore f should be inlined here.

The inliner should be smart enough to know that f is called exactly once, so there is no risk of code duplication. Note that f is marked as static, so the compiler doesn't have to worry about other translation units calling f.

Even with full optimisation and link-time code gen (/Ox and /LTCG) VS2015 does not inline the call to f.

Clang of course does a better job and inlines the call: godbolt.org/g/njG8aL.

VS2017 does inline f in the example above, but fails when f is made larger, so it looks like just the size threshold for inlining has changed.

This example is a toy program, but I ran across this issue in some performance-critical code I am writing, where a crucial function that only has a single call site is not being inlined.

Here are the perf results from force inlining this crucial function:

crucial function non-inlined 
---------------------
Build took 3.0740 s
Build took 3.0300 s

crucial function force-inlined:
------------------
Build took 2.8270 s	
Build took 2.8199 s

As you can see there is a significant performance boost from inlining the function.

Workaround

You can (of course) force-inline a function with __forceinline. However it would be nice if we didn't need to rely on workarounds.

Why does MSVC do this?

MSVC probably refuses to inline large functions that have just a single call-site, because it's possible that such functions are error-handling code, which are seldom-executed, for example:

res = doSomeCode();
if(res == SOME_UNLIKELY_ERROR_CODE)
  doSomeComplicatedErrorHandlingCode();

In this case we don't want to inline doSomeComplicatedErrorHandlingCode since it will just pollute the instruction cache with unexecuted code.

It's also possible that the MSVC guys just never got around to this optimisation :)

Regardless, MSVC could definitely use better heuristics to try and filter out this error handling case from other cases where it would be beneficial to inline (such as when the function is called unconditionally).

Conclusion

The MSVC inliner is missing an optimisation that can result in quite significant speedups - inlining a function when it is only called from one call site.

Also, don't believe anyone when they say the compiler is always smarter than you when it comes to deciding when to inline.

Issue filed with MS here, discussion on reddit here.

Poor Visual Studio code generation for reading from and writing to member variables

Posted 23 Jul 2017

I spend quite a lot of time trying to write fast code, but it is something of an uphill battle with Microsoft's Visual Studio C++ compiler.

Here is an example of how MSVC struggles with member variable reads and writes in some cases.

Consider the following code. In the method writeToArray(), we write the value 'v' into the array pointed to by the member variable 'a'. The length of the array is stored in the member variable 'N'.

class CodeGenTestClass
{
public:
	CodeGenTestClass()
	{
		N = 1000000;
		a = new TestPair[N];
		v = 3;
	}


	__declspec(noinline) void writeToArray()
	{
		for(int i=0; i<N; ++i)
			a[i].first = v;
	}

	__declspec(noinline) void writeToArrayWithLocalVars()
	{
		TestPair* a_ = a; // Load into local var
		int v_ = v; // Load into local var
		const int N_ = N; // Load into local var

		for(int i=0; i<N_; ++i)
			a_[i].first = v_;
	}

	TestPair* a;
	int v;
	int N;
};

Compiler is Visual Studio 2015, x64 target with /O2.

The disassembly for writeToArray() looks like this:

---------------------------------------------------------------------------
		for(int i=0; i<N; ++i)
0000000140055350  xor         r8d,r8d  
0000000140055353  cmp         dword ptr [rcx+0Ch],r8d  
0000000140055357  jle         js::CodeGenTestClass::writeToArray+28h (0140055378h)  
0000000140055359  mov         r9d,r8d  
000000014005535C  nop         dword ptr [rax]  
			a[i].first = v;
0000000140055360  mov         rdx,qword ptr [rcx]       // Load this->a
0000000140055363  lea         r9,[r9+8]  
0000000140055367  mov         eax,dword ptr [rcx+8]     // Load this->v into eax
000000014005536A  inc         r8d  
000000014005536D  mov         dword ptr [r9+rdx-8],eax  // Store value in eax into array
0000000140055372  cmp         r8d,dword ptr [rcx+0Ch]   // Load this->N and compare with loop index.
0000000140055376  jl          js::CodeGenTestClass::writeToArray+10h (0140055360h)  
	}
0000000140055378  ret  
---------------------------------------------------------------------------

I have bolded the inner loop and added some comments.

Rcx here is storing the 'this' pointer. What you can see is that inside the loop, the values of 'a', 'v', and 'N' are repeatedly loaded from memory, which is wasteful.

Let's compare with the disassembly for writeToArrayWithLocalVars():

---------------------------------------------------------------------------
		TestPair* a_ = a; // Load into local var
		int v_ = v; // Load into local var
		const int N_ = N; // Load into local var

		for(int i=0; i<N_; ++i)
0000000140054AE0  movsxd      rdx,dword ptr [rcx+0Ch]  
0000000140054AE4  xor         eax,eax  
0000000140054AE6  mov         r8,qword ptr [rcx]  
0000000140054AE9  mov         r9d,dword ptr [rcx+8]  
0000000140054AED  test        rdx,rdx  
0000000140054AF0  jle         js::CodeGenTestClass::writeToArrayWithLocalVars+1Eh (0140054AFEh)  
			a_[i].first = v_;
0000000140054AF2  mov         dword ptr [r8+rax*8],r9d  // Store value 'v' (in r9d register) into the array
0000000140054AF6  inc         rax                       // increment loop index
0000000140054AF9  cmp         rax,rdx                   // Compare loop index with N
0000000140054AFC  jl          js::CodeGenTestClass::writeToArrayWithLocalVars+12h (0140054AF2h)   // branch
	}
0000000140054AFE  ret  
---------------------------------------------------------------------------

Again I have bolded the inner loop and added some comments.

As you can see, the member variables are not repeatedly loaded in the inner loop, but are instead stored in registers. This is much better, and executes faster:

test_class.writeToArray():              0.000541 s (1.84977 B writes/sec)
test_class.writeToArrayWithLocalVars(): 0.000380 s (2.63310 B writes/sec)

Needless to say Clang gets this right, here's the inner loop for writeToArray(): (see https://godbolt.org/g/juzpfV)

---------------------------------------------------------------------------
.LBB1_4:                                # =>This Inner Loop Header: Depth=1
        mov     dword ptr [rdx + 8*rsi], ecx
        mov     dword ptr [rdx + 8*rsi + 8], ecx
        mov     dword ptr [rdx + 8*rsi + 16], ecx
        mov     dword ptr [rdx + 8*rsi + 24], ecx
        mov     dword ptr [rdx + 8*rsi + 32], ecx
        mov     dword ptr [rdx + 8*rsi + 40], ecx
        mov     dword ptr [rdx + 8*rsi + 48], ecx
        mov     dword ptr [rdx + 8*rsi + 56], ecx
        add     rsi, 8
        cmp     rsi, r8
        jl      .LBB1_4
---------------------------------------------------------------------------

Why this happens

It's hard to say for sure without seeing the source code for MSVC. But I think it's probably a failure of alias analysis.

Basically, a C++ compiler has to assume the worst, in particular it must assume that any pointer can be pointing at anything else in the program memory space, unless it can prove that it is not possible under the rules of the language (e.g. would be undefined behaviour).

In this particular case, we have two pointers in play - the 'this' pointer, and the 'a' pointer, and since we have a write through the 'a' pointer, it looks like MSVC is unable to determine that 'a' does not point to 'v', or 'this', or 'N'.

To be able to prove that a write through 'a' does not overwrite the value in N, or V, MSVC needs to be able to do what is called alias analysis. I believe in this case it would be best done with type-based alias analysis (TBAA).

Since in C++ it is undefined behaviour to write through a pointer with one type (in this case TestPair*), and read through a pointer with another type (CodeGenTestClass* for the this pointer?), therefore the write to this->a cannot store a value that is read from this->v or this->N. Unfortunately MSVC's TBAA is either absent or not strong enough to work this out.

(I may be wrong about the analysis pass required here, compiler experts please feel free to correct me!).

Moving the values into local variables as in writeToArrayWithLocalVars(), allows the compiler to determine that they are not aliasing. (It can determine this quite simply by noting that the address of the local variables is never taken, therefore no aliasing pointers can point at them) This allows the values to be placed into registers.

One thing to note is that MSVC can do the aliasing analysis and produce a fast loop when the 'a' array is of a simple type such as int instead of TestPair. (Edit: Actually this is not the case, MSVC fails in this case also)

Impact

This kind of code is pretty common in C++. I extracted this particular example from some hash table code I was writing.

You will see this kind of problem with MSVC whenever you are writing to and reading from member variables. (Depending on the exact types etc..). So I would say this is a pretty serious performance/codegen problem for MSVC.

Edit: Comment thread on reddit. In this comment Gratian Lup clarifies that MSVC does not do TBAA.

Poor code generation / missed optimisations for placement-new object construction in Visual Studio 2015

Posted 19 Jul 2017

See the bug report

Performance bugs - the dark matter of programming bugs

Posted 22 Mar 2017

There is a class of bugs lurking out there that I class as performance bugs. A performance bug is when the code computes the correct result, but runs slower than it should due to a programming mistake.

The nefarious thing about performance bugs is that the user may never know they are there - the program appears to work correctly, carrying out the correct operations, showing the right thing on the screen or printing the right text. It just does it a bit more slowly than it should have. It takes an experienced programmer, with a reasonably accurate mental model of the problem and the correct solution, to know how fast the operation should have been performed, and hence if the program is running slower than it should be.

I started documenting a few of the performance bugs I came across a few months ago, for example (on some platforms) the insert method of std::map is roughly 7 times slower than it should be, std::map::count() is about twice as slow as it should be, std::map::find() is 15% slower than it should be, aligned malloc is a lot slower than it should be in VS2015.

Types of performance bugs

There are a few types of performance bugs I come across quite often:

* Doing the same work repeatedly and redundantly - It's pretty common to mistakenly perform the same operation multiple times when a single time would have sufficed, for example zeroing the same memory multiple times. This is suprisingly common when you start looking out for it.

* Performing unnecessary work - For example, in Qt, if you call dataChanged() on a single item in a data model, it updates the view for just that item. However if you update 2 items, it updates the entire view rectangle, even if those items are right by each other (e.g. two columns of the same row). See the bug report.

* Choosing a poor algorithm, for example an $O(n^2)$ algorithm by accident. See Accidently Quadratic.

Defining a performance bug

I don't have a precise and bulletproof definition for a performance bug, but it would be something like:

a performance bug is one where the code is slower than the simple, textbook solution of the problem, due to performing redundant or unnecessary work, or choosing the wrong algorithm, or a logic error.

In some cases there will be a fine line between a performance bug, and plain old unoptimised code.

How to find performance bugs

* Use a profiler.

* Try single-stepping through your program in a debugger, making a note of what work is being done. Is work being repeated? John Carmack suggests this method:

"An exercise that I try to do every once in a while is to “step a frame” in the game, starting at some major point like common->Frame(), game->Frame(), or renderer->EndFrame(), and step into every function to try and walk the complete code coverage."

There's also an interesting quote from Carmack about a near performance bug on that same page:

"[...] when I was working on the Doom 3 BFG edition release, the exactly predicted off-by-one-frame-of-latency input sampling happened, and very nearly shipped. That was a cold-sweat moment for me: after all of my harping about latency and responsiveness, I almost shipped a title with a completely unnecessary frame of latency."

* Benchmark implementations of the same functions against each other. If one of the implementations is a lot slower, it may indicate a performance bug in the implementation.

If you do find a perf bug in third party code, make sure to make a bug report to the maintainer(s)! This way software gets faster for everyone.

Conclusion

Performance bugs are out there, lurking and unseen, like the dark matter that (possibly) pervades our universe. How many performance bugs are there in the wild? It's anyone's guess.

What has been your experience with performance bugs?

Discuss on Hacker News or Reddit.

Watch out - Qt headers suppress warnings with Visual Studio

Posted 2 Nov 2016

As I posted on the bug report on the Qt bug tracker:

Hi, We ran into this problem recently. In my opinion this is a somewhat serious problem with Qt that should be fixed.

We noticed this after fixing a bug in our code. A function was incorrectly taking an int argument instead of a double argument, and we were passing a double in. But the usual warning was not reported The usual warning would be something like:

warning C4244: 'initializing' : conversion from 'double' to 'int', possible loss of data

After a little experimentation I determined that the Qt headers are effectively suppressing warnings for all code after they are included, due to the use of #pragma warning in qglobal.h as noted in this bug report.

I don't think it's a good idea for Qt to disable warnings like this such that it effects user code. If Qt wasn't doing this we would have probably found our bug much earlier.

One possible solution is to use #pragma warning push at the start of qt headers, then #pragma warning pop at the end.

In my opinion, 'polluting' the compiler warning state in the way qt does is bad form, and is similar to 'using namespace std' in a header file, which we all know is a bad idea.

Although there is a QT_NO_WARNINGS option, I think it's a good idea to try fixing with the pragma pop idea above. Having an option is better than nothing, but Qt should not have dodgy behaviour by default.

EDIT: Discussion on Reddit

Functional Programming and Performance

Posted 20 Oct 2016

I gave a talk at the Functional Programming North East meetup yesterday, on Functional Programming and Performance.

The talk was on ways functional programming can allow/produce fast programs, covering the following topics:

* Garbage collection and reference counting, and how this can be made easier with functional programming

* Parallelism and functional programming

* Some weaknesses of functional programming performance

Slides are available in Open Office format here, or in PDF here.

Thanks to FP North East for having me!

Edit: Comments on reddit here.

Visualising the topology of the rotations with the help of Portals

Posted 10 Oct 2016

Summary: I present a way of visualising the topology of the space of rotations (SO(3)), using the analogy of portals from the computer game Portal. I use this analogy to explain Dirac's belt trick in a hopefully intuitive way.

The space of possible rotations can be thought of as a sphere, where every point in the sphere defines a rotation in the following way - the rotation is around a vector from the centre of the sphere to the point, in the counter-clockwise direction as usual, with the angle of rotation determined by the length of the vector.

This space of rotations is known as the group SO(3) - the group of special orthogonal 3x3 matrices.

(Note that in mathematics, rotations themselves don't say anything about paths from an initial to final state. Rather a rotation is more like an 'instantaneous' change in orientation.)

The radius of the sphere is $\pi$, so the point at the top of the sphere corresponds to a counter-clockwise rotation of $\pi$ radians (half a full rotation) around the vertical axis, and the point on the bottom of the sphere corresponds to a counter-clockwise rotation around the down vector. Since this vector is in the opposite direction to the vertical axis, the net result is the same rotation. This means that in the space of rotations these points are (topologically) identical. Likewise, any two points directly opposite each other on the boundary of the sphere are identical.

This can be visualised as a a portal connecting any two opposite points, like in the computer game Portal:

portal cover

In the computer game Portal, you can construct portals, a pair of which allow the player or objects to travel ('teleport') from portal to the other. In other words they connect space a bit like a wormhole does in general relativity and science fiction.

One portal is always orange, and this portal is connected to the other portal of the pair, which is always blue.

Now imagine a path through the space of rotations. Consider a path that goes from the bottom of the sphere to the top. This is a path from $-\pi$ to $\pi$ rotation around the upwards vertical axis. It is also a loop (closed path), since it ends up back at the same rotation. I like to think of this path as travelling through the portals.

This path is shown in the following diagram:

topology A path from $-\pi$ to $\pi$ rotation

We can also illustrate this path with a series of teapots. Each teapot is rotated by an amount given by stepping along the path a certain distance. The left teapot corresponds to the bottom of the path. The middle teapot corresponds to the middle of the path and just has the identity rotation. The right teapot corresponds to the top of the path. Note that the rotations 'wrap around', and it ends up with the same rotation as the left teapot.

Now the question arises - can we contract the loop to a single point? e.g. can we continuously deform the loop, so it gets smaller and smaller, until it reaches a single point?

It's clear using the portal analogy that this is not possible. If we try and move one of the portals leftwards around the sphere, to make it closer to the other portal, then the other portal moves rightwards, since the portals are constrained to be directly opposite each other! If you think about it, you will see that you can't contract a loop running through the portals to a point.

Now let's suppose that we have a path that crosses the space of rotations twice, e.g. a double rotation, or rotation from $-2\pi$ to $2\pi$.

topology A path from $-2\pi$ to $2pi$ rotation. Note that the path enters portal a and exits portal a', then enters portal b and then exits portal b'. The paths are shown slightly separated for clarity.

A path from $-2\pi$ to $2pi$ rotation

In the double rotation case, we can contract this loop to a single point!

To do so, we do the following trick: We rotate one pair of portals continuously around the sphere, until they match up with the other pair. Then the loop disappears to a point!

topology A deformation of a $4\pi$ rotation. A deformation of a $4\pi$ rotation. A deformation of a $4\pi$ rotation. A deformation of a $4\pi$ rotation.

This can be demonstrated in an animation. As time advances in the animation, we are moving the portals around the sphere as above, until the portals meet, and the path has contracted to a single point, at which there is no change in rotation along the path. Animation of a path through SO(3), starting with $-2\pi$ to $2\pi$ rotation, deformed continuously until the path vanishes.

This continuous deformation of a path from a $-2\pi$ to $2\pi$ rotation to a point, can be physically demonstrated by twisting a belt, or alternatively with the 'Balinese candle dance', both of which I encourage you to try! (See this video for an example.)

Proposed fix

The fix is to check the interior nodes keys for equality with the target key, and return from the traversal immediately in that case.


_Nodeptr _Lbound(const _Other& _Keyval) const
{	// find leftmost node not less than _Keyval
	_Nodeptr _Pnode = _Root();
	_Nodeptr _Wherenode = this->_Myhead();	// end() if search fails

	while(!this->_Isnil(_Pnode))
		if(_Compare(this->_Key(_Pnode), _Keyval))
			_Pnode = this->_Right(_Pnode);	// descend right subtree
		else
		{	
			// NEW: else node >= keyval
			if(!_Compare(_Keyval, this->_Key(_Pnode))) // NEW: if !(keyval < node):
				return _Pnode; // NEW: then keyval == node, so return it
				
			// _Pnode not less than _Keyval, remember it
			_Wherenode = _Pnode;
			_Pnode = this->_Left(_Pnode);	// descend left subtree
		}

	return (_Wherenode);	// return best remembered candidate
}

This doesn't change the asymptotic time complexity of the find call, it's still O(log(n)). But it does somewhat decrease the number of nodes that need to be traversed on average.

It does require additional key comparisons, which could result in a net slowdown in some cases, but the cost of unneeded memory accesses are likely to be higher than the cost of the additional comparisons in most cases, I believe, especially as the number of entries in the map increases and cache misses become more common.

Results with the proposed fix

I implemented the proposed fix above in a class copied from std::map, called modified_std::map.

Doing 1M lookups in a map and modified map with 2^16 unique int keys:

modified_std::map find: 0.045002 s, num_found: 1000000
std::map find:          0.052951 s, num_found: 1000000

The modified map find method is approximately 15% faster. This isn't a huge amount, but is worth optimising/fixing, for such a commonly-used operation!

There is a similar issue in _Eqrange.

std::map and std::unordered_map performance bug #2 - slow count

Posted 7 Aug 2016

How do you tell if an entry with a specific key has been inserted in a std::map and std::unordered_map? Historically the way was to call find and compare against end, e.g.

if(m.find(x) != m.end()) // If present:
	// do something

With the introduction of C++11, there is another technique - calling count.

if(m.count(x) != 0) // If present:
	// do something

Count returns the number of entries with the given key.

Note also that in std::map and std::unordered_map, count can only return 0 (in the case where the entry is not present) and 1 (when it is). Multiple entries with the same key are not allowed in a map or unordered_map.

You might think that both techniques for testing if an element with a given key is in a map would have similar performance, but this is not the case with some implementations.

Performance result for VS2015, Release build:

std::map count:           1.196503 s, keys_present: 10000000
std::map find:            0.612619 s, keys_present: 10000000
std::unordered_map count: 0.232709 s, keys_present: 10000000
std::unordered_map find:  0.134992 s, keys_present: 10000000

What we see is that count takes about twice the time that find does.

Why does count take so long? Stepping through the code in the VS debugger reveals the answer. Count calls equal_range, which computes the range of elements with the same key. equal_range effectively does two tree traversals, one for the start of the range, and one for the end. This is the reason why count takes approximately twice as long (at least for std::map).

I haven't looked into why std::unordered_map::count is slow but I assume it's for a similar reason.

Proposed solution

My proposed solution is to replace the implementation of std::map::count with something like the following:

size_t count(Keyval& k)
{
	return (find(k) == end()) ? 0 : 1;
}

Based on the test results above, this should make count approximately two times faster for std::map.

More results

Clang (clang version 3.6.0 (trunk 225608)) on Linux:

std::map count:           0.715613 s, keys_present: 10000000
std::map find:            0.708204 s, keys_present: 10000000
std::unordered_map count: 0.200249 s, keys_present: 10000000
std::unordered_map find:  0.110208 s, keys_present: 10000000

Only std::unordered_map::count seems to suffer from this problem, std::map::count is fine.

Xcode (Apple LLVM version 7.3.0 (clang-703.0.31)):

std::map count:           0.392112 s, keys_present: 10000000
std::map find:            0.413565 s, keys_present: 10000000
std::unordered_map count: 0.142392 s, keys_present: 10000000
std::unordered_map find:  0.159157 s, keys_present: 10000000

This std lib does not seem to have the problem.

Test source code: map_count_slowness.cpp.

Edit: discussion on reddit here.

An apparent performance bug with some std::map and std::unordered_map implementations

Posted 6 Aug 2016

A few days ago my colleague Yves mentioned a perplexing finding to me - It was significantly faster to call find on a std::unordered_map, and then insert a item into the map only if the find call indicates it is not present, than to just attempt to insert an item directly with an insert call.

This is strange, because the semantics of std::unordered_map::insert are that if an item with a matching key is already in the map, then the item is not updated, and insert returns a value indicating that it failed.

Calling find, then insert, is effectively doing the lookup twice in the case where the item is not in the map already. (and doing the lookup once when it is).

Calling insert directly should only be doing the lookup once, and hence should be of comparable speed or faster.

Here is some code demonstrating the problem:

const int N = 1000000; // Num insertions to attempt
const int num_unique_keys = 65536;

std::unordered_map<int, int> m;
for(int i=0; i<N; ++i)
{
	const int x = i % num_unique_keys;
	if(m.find(x) == m.end()) // If no such entry with given key:
		m.insert(std::make_pair(x, x));
}

Is significantly faster than

std::unordered_map<int, int> m;
for(int i=0; i<N; ++i)
{
	const int x = i % num_unique_keys;
	m.insert(std::make_pair(x, x));
}

Both code snippets result in a map with the same entries. Perf results:

std::unordered_map find then insert: 0.013670 s, final size: 65536
std::unordered_map just insert:      0.099995 s, final size: 65536
std::map find then insert:           0.061898 s, final size: 65536
std::map just insert:                0.131953 s, final size: 65536

for Visual Studio 2012, Release build.

Note that for std::unordered_map, 'just inserting' is about seven times slower than calling find then insert, when it should be comparable or faster.

std::map also appears to suffer from this problem.

Looking into this, it became apparent quickly that insert is slow, and especially that insert is slow even when it doesn't insert anything due to the key being in the map already.

Stepping through the code in the debugger reveals what seems to be the underlying problem - insert does a memory allocation for a new node (apparently for the bucket linked list) unconditionally, and then deallocates the node if it was not actually inserted! Memory allocations in C++ are slow and should be avoided unless needed.

This seems to be what is responsible for the insert code being roughly seven times slower than it should be.

The obvious solution is to allocate memory for the new node only when it has been determined that it is needed, e.g. only when a matching key is not already inserted in the map.

(There is a related issue as to why chaining is even used in std::unordered_map as opposed to open addressing with linear probing, but I will leave that for another blog post)

Performance of std::map and std::unordered_map can be critical in certain algorithms, so suffering from such massive unneeded slowdowns could be significantly slowing down a lot of code using std::unordered_map.

More results on other platforms

The problem is not fixed in VS2015:

std::unordered_map find then insert: 0.015878 s, final size: 65536
std::unordered_map just insert:      0.096205 s, final size: 65536
std::map find then insert:           0.060897 s, final size: 65536
std::map just insert:                0.128556 s, final size: 65536

Xcode on MacOS Sierra: (Apple LLVM version 7.3.0 (clang-703.0.31)):

std::unordered_map find then insert: 0.020444 s, final size: 65536
std::unordered_map just insert:      0.069502 s, final size: 65536
std::map find then insert:           0.053847 s, final size: 65536
std::map just insert:                0.094894 s, final size: 65536

So the Clang std lib on Sierra seems to suffer from the same problem.

Clang 3.6 I built on Linux: (clang version 3.6.0 (trunk 225608)):

std::unordered_map find then insert: 0.013862 s, final size: 65536
std::unordered_map just insert:      0.015043 s, final size: 65536
std::map find then insert:           0.076331 s, final size: 65536
std::map just insert:                0.092790 s, final size: 65536

The results for this Clang are much more reasonable.

I haven't tried GCC yet.

If you want to try yourself, source code for the tests is available here: map_insert_test.cpp. You will have to replace some code to get it to compile though.

Bug report filed with MS here.

Edit: Interesting discussion on reddit here and in the c++ subreddit here.

Special relativity, faster-than-light communication, and absolute time and space

Posted 23 Jul 2016

Summary: Faster-than-light (FTL) communication is traditionally thought to violate causality according to special relativity. However if there is an absolute space and time (absolute reference frame), causality violation is avoided. Instead observers can detect and measure their speed relative to the absolute reference frame.

The standard story

Faster-than-light (FTL) communication is thought by some to allow signalling back in time, thereby violating causality.

(It's worth noting that no FTL communication is actually known to be possible in the real world, however it's interesting from a metaphysical point of view, and also for science-fiction authors!)

It works like this: imagine you have two space stations floating in space, stationary relative to each other (s.s. a and s.s. b on illustration below). Then imagine you have a rocket that flies past the space stations at some velocity v.

Then imagine that you have some device that can transmit information faster than light; let's suppose that it can transmit with an infinite velocity. So space station a sends a FTL pulse to space station b. This is the horizontal green line from O to A in the illustration.

As space station b receives the FTL pulse, we've arranged it so that the rocket is at that exact moment flying right past space station b. So space station b communicates with the rocket. (Just with normal light is fine since they are basically at the same position)

Here's the tricky bit: Special relativity says that the rocket has a different coordinate frame than the space stations. In particular the x axis, which corresponds to all the points in space at the same rocket-time, is not the same as the x-axis for the space stations.

So the standard story is that the infinite velocity FTL commuication pulse, sent back from the rocket to the left (lower green arrow from A to B), will travel along the skewed x axis of the rocket-time.

Space station a then receives the pulse at B. Crucially, from the point of view of space station a, it has received the return communication before it sent it!

This is where the paradoxes come in. For example, what if the message said something like 'do not send the message', or 'kill this person, who is the grandfather of the message sender'. Then a contradiction arises. Much ink has been spilled (or keys mashed) trying to resolve this seeming contradiction.

v:
Interactive illustration - try changing the v value. A depiction of how FTL signalling violates causality in the standard special relativity story. Green lines = FTL communication. Blues lines = trajectory of space stations. Bold red line = trajectory of rocket.

With absolute space and time

With an absolute space and time, sending messages back in time is prohibited. Let us suppose that the space stations are at rest with respect to absolute space. Then the message from the rocket back to space station a does not point down in the space-time plot, but is instead just horizontal. Therefore no contradiction can arise.

v:
Interactive illustration - try changing the v value. A depiction of how causality violation with FTL signalling is avoided with absolute time. Green lines = FTL communication. Blues lines = trajectory of space stations. Bold red line = trajectory of rocket.

What happens instead of a contradiction, is that with FTL communication, you can determine your velocity relative to the absolute reference frame (absolute space).

Imagine a setup where you have two meter rulers, one pointing in the positive x direction, one in the negative x direction. You simultaneously send a FTL pulse to both ends of the rulers. There is a device at each end of the rulers that when it receives the FTL pulse, sends a light pulse back to you.

If you are at rest relative to the absolute reference frame, the light pulses will return to you at exactly the same time. However if you are at motion relative to the absolute reference frame, the light pulses will return at different times. In particular, you will receive the light pulse from a ruler pointing in the direction of travel earlier than the pulse from a ruler pointing away from the direction of travel. See the illustration below. An observer can calculate their speed relative to the absolute reference frame by measuring the difference in times between the return of the light pulses.

Physicists might refer to this as 'symmetry breaking' - the apparent equality of each reference frame is broken, and you can detect the absolute frame from the comfort of your laboratory.

v:
Interactive illustration - try changing the v value. A depiction of an observer measuring their speed relative to absolute space with FTL signalling. Green lines = FTL communication. Blues lines = light pulses. Red line = world line of observer moving relative to absolute reference frame.

Special relativity and absolute time and space

Special relativity does not eliminate the possibility of an absolute reference frame - rather it is an axiom of special relativity that it is impossible to detect the velocity of the observer relative to any possible absolute reference frame.

It is an additional metaphysical step to discard the undetectable absolute reference frame as non-existing.

To quote Matt Visser:

Warning: the "Einstein relativity principle" does not imply that there's no such thing as "absolute rest"; it does however imply that the hypothetical state of "absolute rest" is not experimentally detectable by any means. In other language, the "Einstein relativity principle" does not imply that the "aether" does not exist, it implies that the "aether" is undetectable. It is then a judgement, based on Occam's razor, to dispense with the aether as being physically irrelevant.

Some further food for thought regarding a potential absolute reference frame that not everyone is aware of - did you know that what is thought to be the afterglow of the Big Bang, called the Cosmic Microwave Background (CMB) is visible in all directions from Earth, and that scientists have measured the speed of the Earth relative to the CMB, which is possible due to the red-shift of the CMD spectrum in the direction of Earth's travel? Earth is travelling at about 0.002c (0.2% the speed of light) through the Universe relative to the CMB. Personally I regard the CMB as some kind of evidence for absolute space and time.

General relativity (the theory of gravity which explains gravity as curvature of space-time) also has profound things to say about space and time, but I have restricted myself to special relativity in this post.

Edit: Added some text in the 'With absolute space and time' section clarifying that the space station is stationary with respect to absolute space. Added a space-time diagram in the 'With absolute space and time' section to illustrate what I propose would happen with the FTL signals.

Things that suck about C++ #1

Posted 12 Apr 2016

I found a nasty bug today in some of my code. Let me explain.

First, pop quiz: What is the value of this C++ expression:

10.0f + true ? 1.0f : 2.0f;

If you answered 11.f, you're wrong. The actual answer is 1.0f.

The ternary conditional operator (?:) has very low precedence (See http://en.cppreference.com/w/cpp/language/operator_precedence, it's right down in box 15). The plus operator has higher precedence, so the expression gets evaluated as

(10.0f + true) ? 1.0f : 2.0f;

This code, which is a gross violation of any reasonable type system, compiles without warning! You can add 10.0f to a boolean value apparently and it's just fine! Of course the reason is that true gets converted to the integer 1, for historical C reasons presumably. And then the resulting value of 11.f effectively gets converted back to a boolean! (to evaluate the condition)

The thing that sucks about C++ in this case is the automatic conversion of true to 1, which allows 10.0f + true to pass the type checker. Automatic conversion/casting of types in a language is very dangerous. Each case should be closely considered by the designer(s). And in this case it's just a bad idea. Fuck you C++.

Edit: This is how our programming language Winter handles it:

AdditionExpression: Binary operator '+' not defined for types 'float' and 'bool'

buffer, line 1:
def main(real x) real : 10.0f + true ? 1.0f : 2.0f
                              ^

Much better.

Fast sorting on the CPU

Posted 17 Feb 2016

If you need to sort some data on the CPU, and you want to do it efficiently, you can do a lot better than std::sort.

For the uninitiated, std::sort is the sort function in the C++ standard library. It's a comparison-based sort, usually implemented as quicksort.

Recently I did some work on a parallel radix sort. A radix sort is not a comparison based sort. Instead it uses the actual values of 'digits' (chunks of bits) of the key to efficiently sort the items. As such radix sorts can run faster than comparison based sorts, both in terms of computational complexity and in practical terms.

Some results

The task is to sort N single-precision float values (32-bit floats). The floats have random values.

For the timings below N = 2^20 = 1048576. Best values from ~10 runs are given. Computer: i7-3770 (4 hardware cores), Platform: Windows 8.1 64-bit, compiler: Visual studio 2012, full optimisations + link time code generation.

std::sort:           0.08427573208609829 s  (12.442205769612862 M keys/s)
tbb::parallel_sort:  0.016449525188363623 s (63.745061817453646 M keys/s)
Radix sort:          0.013582046893134248 s (77.20309083383134  M keys/s)
Parallel radix sort: 0.006327968316327315 s (165.70500160288134 M keys/s)

Note that my parallel radix sort is roughly 13 times faster than std::sort!

Breaking down the speedup over std::sort, it looks something like this:

std::sort -> serial radix sort gives a ~6.2 times speedup.

serial radix sort -> parallel radix sort gives a ~2.1 times speedup.

This is way below a linear speedup over the serial radix sort, considering I am using 8 threads in the parallel sort. This likely indicates that memory bandwidth is the limiting factor.

Limitations of radix sort

Radix sort is not always the appropriate sorting choice. It has some memory and space overheads that make it a poor choice for small arrays. Once the number of elements gets smaller than 300 or so I have measured std::sort to be faster.

In addition, to be able to radix sort items, you need to be able to convert each key into a fixed length number of bits, where the ordering on the bit-key is the same as the desired ordering. Therefore you can't sort e.g. arbitrary length strings with radix sort. (Although you could probably use some kind of hybrid sort - radix sort first then finish with a quicksort or something like that)

Conclusion

Parallel radix sort is awesome!

If you are really desperate to sort fast on the CPU and want to use/licence the code, give me an email. Otherwise we'll probably open-source this parallel radix sorting code, once we've got Indigo 4 out the door.

Implementing a parallel radix is also quite fun and educational. I might write an article on how to do it at some point, but writing your own is fun as well!

Fixed length outputs

Note that this proof only applies for hash functions from strings of N bits to strings of N bits. Some cryptographic hash functions in the real world are defined like this, such as the Skein hash function. However, most cryptographic functions produce a fixed length output, such as SHA-256 that has a 256 bit output length.

Regardless, I think the same general principle applies. As far as I know, no hash function with a fixed length output has been proved to have the analogous properties as listed above. In other words, no hash function with fixed length output N has been proved to take exponential time on N to find an input that maps to the given output.

One-way functions

This property of hash functions we are thinking about is also called the 'one-way' function property. A hash function with this property is one-way in the following sense: we can compute the output $f(x)$, from the input $x$, easily and efficiently. However going in the 'other direction', e.g. computing the inverse, $f^{-1}(y)$, is not computable efficiently.

Remarkably we don't know if such one-way functions exist.

Implications

The implication of the above proof and comments is that no-one knows if cryptographic hash functions like the SHA hash functions really have the secure properties that people hope they do.

Many aspects of modern software depend on such hash functions, such as message authentication codes, code signing of executables, and even systems such as bitcoin, which uses the SHA-256 hash function in its proof-of-work scheme.

That these hash functions are not proved to be cryptographically secure is somewhat well known, but it's not that uncommon to hear people assuming that the security or properties of such systems have actually been proved.

This is quite similar to the situation with factoring of large numbers which is used in public key cryptography, which is suspected, or commonly treated as if it is not solvable in polynomial time, even though it has not been proved.

Sorting algorithms

The task of a sorting algorithm is to take a permutation of some original sorted list, and return the original sorted list.

Consider a comparison based sorting algorithm, that at each step can compare two elements together with the less than (<) operator. At each step it can eliminate up to half of the possible permutations. In other words, it can gather one bit of information with each comparison.

So to gather all the bits of information needed to determine which permutation it is looking at, it needs to make $\log_2(n!)$ comparisons.

It could do it less efficiently, but this is the lower bound on the number of comparisons which can handle the worst case.

(This is a very hand-wavy 'proof', but I hope you get the general idea)

Stirling's approximation

Stirling's approximation tells us that $$\ln(n!) \approx n \ln(n) - n$$ Here $\ln(x)$ is the natural logarithm function, e.g. $\log$ with base $e$. We can convert natural logarithms to log base 2 by dividing the log function by $\ln(2)$, which is about 0.6931, e.g. $$\log_2(x) = \ln(x) / \ln(2)$$ or $$\log_2(x) \times \ln(2) = \ln(x)$$ Using this rule in previous equation gives: $$\ln(n!) \approx n \ln(n) - n \\ \log_2(n!) \times 0.6931 \approx n \log_2(n) \times 0.6931 - n \\ \log_2(n!) \approx n \log_2(n) - n / 0.6931 $$ When you are doing asymptotic complexity analysis, constants don't matter.

So the optimal asymptotic complexity of the comparison based sort algorithm is $n \log_2(n)$, which is the same asymptotic complexity as $n \ln(n)$.

Let's look at some actual data points. For $n = 10$:

n: 10
n!: 3628800
log_2(n!): 21.791061114716953
n*log_2(n): 33.219280948873624
Stirling's approximation (n log_2(n) - n / log(2)): 18.79233053998399

For $n = 32$:

n: 32
n!: 2.631308369336935e35
log_2(n!): 117.66326365236031
n*log_2(n): 160
Stirling's approximation (n log_2(n) - n / log(2)): 113.83375869155317

For $n = 150$:

n: 150
n!: 5.7133839564458505e262
log_2(n!): 872.8595063470792
n*log_2(n): 1084.322803574382
Stirling's approximation (n log_2(n) - n / log(2)): 867.9185474410376

One interesting thing to note is that n*log_2(n) is quite a poor estimate of the number of comparisons needed (it is quite a lot too large), at least for these relatively small n values. the (ceiling'd) log_2(n!) values give the actual number of comparisons needed.

So there you go, some curious things about permutations and sorting. I hope some people found this interesting. If there are any bits of maths that you don't understand, let me know and I will go into more depth. Corrections welcome of course.

Slow pscp upload speeds

Posted 27 Jan 2016

I sent this email to the Putty guys, but I haven't heard back from them. So I'll post it here.

---------------------------

First of all, thanks for your work on Putty. It's a very useful tool for us.

However pscp uploads much more slowly than it should for us. We are uploading from our office network (UK) to a server in a German data centre. Ping latency to the server is ~31ms.

Uploading a ~100MB file, putty stabilises at around 220kB/s: IndigoRenderer_x64_4.0.14 | 100580 kB | 218.2 kB/s | ETA: 00:00:43 | 91%

The computer it is running on is running Windows server 2008, on an i7-2600, with 16GB RAM.

In comparison, scp running on a Linux machine on the same LAN, uploading to the same server, gets around 8MB/s upload speed.

speedtest.net shows a ~60Mb/s upload speed from our office.

This indicates to me that the pscp upload speeds are a lot slower than they could be. Is this a problem you are aware of? Are there any plans to improve the upload speed?

Finding annoying programs that steal window focus

Posted 22 Nov 2015

Such programs are super annoying. I had this problem on my computer - some program is occasionally stealing window focus, resulting in being switched out of full-screen games at inopportune times etc..

So I made a program to detect such annoying programs - https://github.com/glaretechnologies/focusmonitor.

I figured out the offending program is C:\Users\nick\AppData\Local\Akamai\installer_no_upload_silent.exe.

Dear Akamai: please fix your annoying program to not steal window focus. Thank you.

I have looked at this program that presumably does the same thing: http://www.happydroid.com/focus, but Windows Defender said it had a virus in it. Not sure if that is true, but I'm hesitant to run unsigned executables from random sources anyway. So I made my own.

False Sharing Curiosities

Posted 18 Oct 2015

I ran into an interesting effect this weekend while working on some parallel counting code.

After a little experimentation I determined that it is likely to be a false sharing effect.

False sharing is where the same cache line is read/modified by multiple different threads in a loop, and can cause major performance degradation.

The counting code loop looked like this:

for(size_t i=0; i<num; ++i)
	counts[bucket_chooser(in[i])]++;

The counts buffer for each thread has 8 x 8 byte elements. Each thread sees a range of the overall count buffer.

The odd thing is, is that the per-thread count table is 64 bytes already, which is the cache-line size (unit of cachable memory) of modern CPUs. This means that the count tables for each thread are on different cache lines.

Edit: I was careful to align the per-thread count table to 64 bytes as well.

However when I added a large amount of padding between the count tables, the performance shot up.

I made a graph of padding between each 64-byte table:

The peaks are at 0 B, 32 B, 64 B, 96 B, 128 B etc..

So it seems that to avoid false sharing effects, you need quite a lot of padding between data - something like 128 bytes, much more than the zero bytes indicated in almost all online sources.

In other words, it's not enough to just separate the cache lines that each thread works on, they must be separated by several cache-line widths.

Has anyone else seen or know about similar behaviour? CPUs are very complicated so it's possible I may be seeing some other effect, or there may be a bug in my code.

I saw a comment somewhere that the effect may be due to prefetching of adjacent cache-lines by the CPU, this may be an explanation.

Edit: Here's some C++ source code to demonstrate this effect. This is not the source used to generate the above graph, but is similar to it, and uses C++11 threading, timing etc..


#include <chrono>
#include <thread>
#include <iostream>
#include <random>


void threadFunc(const int* data_, size_t num, size_t* local_counts_)
{
	const int* const data = data_;
	size_t* const local_counts = local_counts_;

	for(size_t i=0; i<num; ++i)
		local_counts[data[i]]++;
}


void testWithPadding(const std::vector<int>& data, size_t padding_elems)
{
	auto start_time = std::chrono::high_resolution_clock::now();

	static_assert(sizeof(size_t) == 8, "sizeof(size_t) == 8");

	const int num_threads = 4;

	// Allocate memory for counts, align to 128-byte boundaries.
	size_t* counts = (size_t*)_mm_malloc(num_threads * (8 + padding_elems) * sizeof(size_t), 128);
	std::memset(counts, 0, num_threads * (8 + padding_elems) * sizeof(size_t)); // zero counts
	
	std::thread threads[num_threads];
	for(int i=0; i<num_threads; ++i)
		threads[i] = std::thread(threadFunc, /*data=*/data.data() + (data.size()/num_threads)*i, 
			/*num=*/data.size()/num_threads, /*local_counts=*/counts + (8 + padding_elems)*i);

	for(int i=0; i<num_threads; ++i)
		threads[i].join();

	auto end_time = std::chrono::high_resolution_clock::now();
	std::chrono::duration<double> diff = end_time - start_time;

	std::cout << "padding: " << padding_elems*sizeof(size_t) << " B, elapsed: " << diff.count() << " s" << std::endl;

	_mm_free(counts);
}


void falseSharingTest()
{
	// Generate random data to count
	std::default_random_engine generator;
	std::uniform_int_distribution<int> distribution(0, 7);
	const size_t DATA_SIZE = 1 << 26;
	std::vector<int> data(DATA_SIZE);
	for(size_t i=0; i<DATA_SIZE; ++i)
		data[i] = distribution(generator);

	// Test performance with various padding sizes
	for(int i=0; i<32; ++i)
		testWithPadding(data, i);
}

Resulting output:

padding: 0 B, elapsed: 0.0312501 s
padding: 8 B, elapsed: 0.0781306 s
padding: 16 B, elapsed: 0.0625087 s
padding: 24 B, elapsed: 0.062502 s
padding: 32 B, elapsed: 0.0312535 s
padding: 40 B, elapsed: 0.0625048 s
padding: 48 B, elapsed: 0.0312495 s
padding: 56 B, elapsed: 0.0312519 s
padding: 64 B, elapsed: 0.015628 s
padding: 72 B, elapsed: 0.0312498 s
padding: 80 B, elapsed: 0.0156279 s
padding: 88 B, elapsed: 0.0312538 s
padding: 96 B, elapsed: 0.0156246 s
padding: 104 B, elapsed: 0.0156267 s
padding: 112 B, elapsed: 0.0156267 s
padding: 120 B, elapsed: 0.0156265 s
padding: 128 B, elapsed: 0.0156309 s
padding: 136 B, elapsed: 0.0156219 s
padding: 144 B, elapsed: 0.0156246 s
padding: 152 B, elapsed: 0.0312559 s
padding: 160 B, elapsed: 0.0156231 s
padding: 168 B, elapsed: 0.015627 s
padding: 176 B, elapsed: 0.0156279 s
padding: 184 B, elapsed: 0.0156244 s
padding: 192 B, elapsed: 0.0156258 s
padding: 200 B, elapsed: 0.0156261 s
padding: 208 B, elapsed: 0.0312522 s
padding: 216 B, elapsed: 0.0156265 s
padding: 224 B, elapsed: 0.0156264 s
padding: 232 B, elapsed: 0.0156267 s
padding: 240 B, elapsed: 0.0156249 s
padding: 248 B, elapsed: 0.0156276 s

It looks like VS2012's 'high resolution' (std::chrono::high_resolution_clock) clock is not actually high resolution, so feel free to replace the timing code with a proper high-resolution clock. I used our QueryPerformanceCounter-based timer in the original code.

Edit: A little more detail on the prefetching explanation - from the Intel Optimisation manual:

3.7.3 Hardware prefetching for Second-Level Cache

"Streamer - Loads data or instructions from memory to the second-level cache. To use the streamer, organize the data or instructions in 128 bytes, aligned on 128 bytes. The first access to one of the two cache lines in this block while it is in memory triggers the streamer to prefetch the pair line."

Maybe my CPU (i7 Ivy Bridge) is fetching both the cache lines to the left and to the right of the actually used cache line. This would explain the 128-byte padding seemingly needed for optimum performance.

Edit: reddit discussion here.

Indirection isn't slow, unpredictable indirection is

Posted 10 Oct 2015

Indirection is an extremely common programming technique. In fact, it's well known that all problems in computer science can be solved with an extra layer of indirection.

Accessing data through a layer of indirection has some performance impact. The most important aspect of the performance impact however, is if the indirection is predictable. If the ultimately accessed data is accessed in a coherent order, even through a layer of indirection, performance is still pretty good.

Some results of the code below, which tests 2^26 memory reads done in various ways:

Data size: 256.000 MB
no indirection elapsed:        0.02169 s (12.38 GiB/s, 1.196 cycles/read)
sum: 4261412864
predictable indices elapsed:   0.03996 s (6.718 GiB/s, 2.203 cycles/read)
sum: 4261412864
unpredictable indices elapsed: 0.9277 s (0.2894 GiB/s, 51.15 cycles/read)
sum: 4261412864

As you can see, adding a layer of indirection has some performance impact (~2.2 cycles per read instead of ~1.2 cycles.). However, as long as the accessed data is accessed in a predictable, coherent fashion, then the cache and prefetcher on the CPU means the slowdown is not too bad.

If, however, the order of access is not coherent and predictable, performance takes a nose-dive, as the CPU cache and prefetcher cannot effectively fetch memory before it is needed or efficiently when it is need, and massive latency is incurred.

This kind of performance degradation is what I expect from unpredictable and incoherent memory access, with or without a layer of indirection.

The key point is that it's not indirection per se that is slow, but unpredictable and incoherent memory access patterns that are commonly seen with indirection.

const int N = 1 << 26; // num reads
const double cycles_per_sec = 3.7e9; // For my i7 Ivy bridge, running in 'turbo-boost' mode.

// Allocate some data, fill with semi-random values.
std::vector<uint32> data(N);
for(int i=0; i<N; ++i)
	data[i] = i % 128;

conPrint("Data size: " + getNiceByteSize(N * sizeof(uint32)));

// Test data access without indirection
{
	Timer timer;
	uint64 sum = 0;
	for(int i=0; i<N; ++i)
	{
		sum += data[i];
	}

	double elapsed = timer.elapsed();
	const double bytes_per_sec = 4 * N / elapsed;
	const double cycles_per_read = (elapsed * cycles_per_sec) / N; // num cycles = elapsed * cycles-per-sec
	conPrint("no indirection elapsed:        " + doubleToStringNSigFigs(elapsed, 4) + " s (" + doubleToStringNSigFigs(bytes_per_sec * 1.0e-9, 4) + " GiB/s, " + doubleToStringNSigFigs(cycles_per_read, 4) + " cycles/read)");
	printVar(sum);
}

// Test indirection with predictable indices
{
	std::vector<uint32> indices(N);
	for(int i=0; i<N; ++i)
		indices[i] = i;

	Timer timer;
	uint64 sum = 0;
	for(int i=0; i<N; ++i)
	{
		sum += data[indices[i]];
	}

	double elapsed = timer.elapsed();
	const double bytes_per_sec = 4 * N / elapsed;
	const double cycles_per_read = (elapsed * cycles_per_sec) / N;
	conPrint("predictable indices elapsed:   " + doubleToStringNSigFigs(elapsed, 4) + " s (" + doubleToStringNSigFigs(bytes_per_sec * 1.0e-9, 4) + " GiB/s, " + doubleToStringNSigFigs(cycles_per_read, 4) + " cycles/read)");
	printVar(sum);
}

// Test indirection with unpredictable indices
{
	std::vector<uint32> indices(N);
	for(int i=0; i<N; ++i)
		indices[i] = i;

	// Permute indices
	for(int t=N-1; t>=0; --t)
	{
		int k = (int)(rng.unitRandom() * t);
		mySwap(indices[t], indices[k]);
	}

	Timer timer;
	uint64 sum = 0;
	for(int i=0; i<N; ++i)
	{
		sum += data[indices[i]];
	}

	double elapsed = timer.elapsed();
	const double bytes_per_sec = 4 * N / elapsed;
	const double cycles_per_read = (elapsed * cycles_per_sec) / N;
	conPrint("unpredictable indices elapsed: " + doubleToStringNSigFigs(elapsed, 4) + " s (" + doubleToStringNSigFigs(bytes_per_sec * 1.0e-9, 4) + " GiB/s, " + doubleToStringNSigFigs(cycles_per_read, 4) + " cycles/read)");
	printVar(sum);
}

A proposal for C++: bit_cast.

Posted 5 Oct 2015

C++ should have a bit_cast function.

bit_cast<T>(x) would get the bits of x, and reinterpret them as the bits of a value of type T.

For example, you would be able to write

float x = 1.0f;
const uint32_t y = bit_cast<uint32_t>(x);

And then y would have the value 0x3f800000.

From a generated-code point of view this would be a no-op. (Or maybe some instructions would have to be emitted for moving to/from floating point registers in this case?)

Edit: A bit_cast is only valid if the source and destination types are the same size. Using bit_cast on types differing in size would give a compile-time error.

This kind of bit conversion needs to be done pretty often if you are writing low-level, high peformance code, or networking code.

There are a number of ways to do something similar with C++ today, but they all have problems, either with correctness or elegance.

Current option 1: pointer casting

Possibly the most old-school way is to pointer cast, like this:

float x = 1.0f;
const uint32_t y = *(uint32_t*)&x;

The problem here is that it is undefined behaviour to write through one pointer type (float*) and read through another (uint32_t*). (due to aliasing rules). Compilers may whinge about it and/or generate code that does not do what you want. For example, GCC says

"warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]"

(see code example here: https://goo.gl/wZcV2D).

Current option 2: union abusing

The trick here is to write into one of the union fields, then read from the other one.

float x = 1.0f;

union fi_union
{
	float f;
	uint32_t i;
};

fi_union u;
u.f = x;
const uint32_t y = u.i;

This *may* work in practice, but is syntactically ugly, is an abuse of the idea of unions, and may be undefined behaviour.

Current option 3: memcpy

The idea here is to just memcpy() from the src to the dest, and rely on the compiler to recognise what you are doing, and to not actually emit the memcpy function call. Example:

template <class Src, class Dest> 
inline Dest bit_cast(const Src& x)
{
	static_assert(sizeof(Src) == sizeof(Dest), "sizeof(Src) == sizeof(Dest)");
	Dest d;
	std::memcpy(&d, &x, sizeof(Dest));
	return d;
}

This works well in practice, but is a little dangerous in the sense that you are relying on a compiler optimisation to avoid very slow code (e.g. it would be very slow if a function call was actually emitted).

Current (non-)option 4: static_cast

static_cast does not do what we want.

float x = 1.0f;
const uint32_t y = static_cast<uint32_t>(x);

puts the value '1' in y, not 0x3f800000 as we want.

Current (non-)option 5: reinterpret_cast

To be honest I'm not sure what reinterpret_cast is for. I never use it. It also doesn't do what we want:

float x = 1.0f;
const uint32_t y = reinterpret_cast<uint32_t>(x);

gives, in Visual Studio 2012: "error C2440: 'reinterpret_cast' : cannot convert from 'float' to 'uint32_t' Conversion is a valid standard conversion, which can be performed implicitly or by use of static_cast, C-style cast or function-style cast"

Semantics

(Added in an edit)

The bit_cast would give the same result as the bit_cast function above:

template <class Src, class Dest> 
inline Dest bit_cast(const Src& x)
{
	static_assert(sizeof(Src) == sizeof(Dest), "sizeof(Src) == sizeof(Dest)");
	Dest d;
	std::memcpy(&d, &x, sizeof(Dest));
	return d;
}

Because we know 'd' (the destination) is in a different memory location from 'x' (the source), there shouldn't be any aliasing problems.

Conclusion

So in conclusion, the alternatives are basically hacky workarounds to something missing in the original language. Older alternatives are increasingly rendered invalid due to fussier optimising compilers. I think the problem should be solved once and for all with a built-in language feature - bit_cast.

Edit: reddit discussion here.

Edit: posted to 'ISO C++ Standard - Future Proposals' list here.

A little Winter demo

Posted 11 Aug 2015

The work-in-progress inlining pass in Winter, combined with the closure and first-class function support described earlier, allows this program:

def compose(function<float, float> f, function<float, float> g) : \(float x) : f(g(x))

def addOne(float x) : x + 1.0

def mulByTwo(float x) : x * 2.0 

def main(float x) float :  
	let 
		z = compose(addOne, mulByTwo)
	in
		z(x)

To compile to this:

.LCPI0_0:
	.long	1065353216              # float 1
	.text
	.globl	main_float_
	.align	16, 0x90
	.type	main_float_,@function
main_float_:                            # @main_float_
# BB#0:                                 # %decr_function<float, float>.exit40
	vaddss	%xmm0, %xmm0, %xmm0
	movabsq	$.LCPI0_0, %rax
	vaddss	(%rax), %xmm0, %xmm0
	retq

The compose function isn't as pretty as in Haskell, but it works, and as you can see it can be inlined to result in fast code.

Closure support in Winter

Posted 6 Aug 2015

Winter now has proper runtime support for first class functions, and anonymous functions (lambda expressions) with lexical closures.

A simple example:

def main() float :	
	let 
		f = \(float x) : x*x
	in
		f(2.0)

returns 4.0.

In this example an anonymous function (that computes the square of its argument) is created and assigned to f. It is then called with the argument 2.0.

Since functions are 'first class', they can be returned from functions:

def makeFunc(float x) function<float> : \() : x
def main() float :  
	let 
		f = makeFunc(2.0) 
	in   
		f()

In this example makeFunc returned a function.

To support lexical scoping, variables are captured:

def makeFunc() function<int, int> :	
	let		
		a = [10, 11, 12, 13]va		
	in						
		\(int x) int : a[x]	
				
def main(int x) int :       
	let	
		f = makeFunc()	
	in	
		f(x)

In the example above the anonymous function returned from makeFunc() captures the variable-length array a, which is indexed into when the function is actually called with f(x).

Closures are allocated on the heap by default. If they can be proven not to escape the current function (which is currently computed with a simple type-based escape analysis), they are allocated on the stack.

The runtime layout for a closure looks like this currently (in LLVM IR):

%anon_func_29_float__closure = type 
{ 
	i64, 
	i64, 
	float (float, %base_captured_var_struct*)*, 
	void (%base_captured_var_struct*)*, 
	%anon_func_29_float__captured_var_struct
}

It has 5 fields: a refcount, some flags (stack or heap allocated), the function pointer, a destructor pointer for cleaning up the captured vars, and a structure containing the captured vars.

All this abstraction doesn't have to come at the cost of runtime inefficiency. Take this program:

def main(array<float, 4> a, array<float, 4> b) array<float, 4> : map(\(float x) : x*x, a)

It generates the optimised assembly:

	vmovups	(%rdx), %xmm0
	vmulps	%xmm0, %xmm0, %xmm0
	vmovups	%xmm0, (%rcx)
	movq	%rcx, %rax
	retq

Unicode string handling in Winter

Posted 6 Jul 2015

There is a fantastic article by Matt Giuca about mistakes programming language designers make and made with regards to Unicode support (or lack thereof): The importance of language-level abstract Unicode strings.

He notes that only Haskell gets it right. (Maybe Rust got it right since then?)

Winter gets it right as well - strings are treated as sequences of Unicode chars, not sequences of bytes. The encoding (UTF-8) is a hidden implementation detail.

Some example Winter code snippets:

codePoint(elem("a\u{393}c", 0)))   #  returns (int)'a'.
codePoint(elem("a\u{393}c", 1)))   #  returns 0x393.
codePoint(elem("a\u{393}c", 2)))   #  returns (int)'c'.

Note that \u{393} in the above is a Unicode escape sequence specifying the Unicode character U+393.

Edit: Reddit discussion.

Don't use LLVM's BBVectorize pass

Posted 28 Jun 2015

Don't use it, it's buggy. I ran into a miscompilation as described here: llvm.org/bugs/show_bug.cgi?id=23845 and it is going to be removed soon anyway: groups.google.com/forum/#!topic/llvm-dev/en-Rbu42K9w

Auto-vectorisation and parallelisation in Winter

Posted 24 May 2015

I have reached an interesting milestone in our functional programming language Winter - Winter now can do auto-parallelisation, auto-vectorisation, and even both at the same time :)

Auto-parallelisation means that code is automatically executed in parallel in multiple threads. Auto-vectorisation means that vector (SIMD) instructions are automatically used instead of scalar instructions. This means that e.g. 4 or 8 multiplications can be done with one instruction.

Taking some Winter code, which applies a function to an input array, generating a result array:

def square(float x) float : x*x	
def main(array<float, 268435456> a, array<float, 268435456> b) array<float, 268435456> : 
    map(square, a)

We're using a large array size here so we get a decent, measurable amount of work to do.

The assembly from the inner loop looks like this:

.LBB2_3:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vmovaps	(%rdx,%rax), %ymm0
	vmovaps	32(%rdx,%rax), %ymm1
	vmovaps	64(%rdx,%rax), %ymm2
	vmovaps	96(%rdx,%rax), %ymm3
	vmulps	%ymm0, %ymm0, %ymm0
	vmulps	%ymm1, %ymm1, %ymm1
	vmulps	%ymm2, %ymm2, %ymm2
	vmulps	%ymm3, %ymm3, %ymm3
	vmovaps	%ymm0, (%rcx,%rax)
	vmovaps	%ymm1, 32(%rcx,%rax)
	vmovaps	%ymm2, 64(%rcx,%rax)
	vmovaps	%ymm3, 96(%rcx,%rax)
	subq	$-128, %rax
	addq	$-32, %r10
	jne	.LBB2_3

Very nice code here, LLVM is doing a good job. You can see there are 4 256-bit memory loads being done in parallel, then 4 256-bit 8-element multiplications being done in parallel, then 4 stores being done.

This code runs at memory speed, which is about 5.7 GB/s in this case on my machine. Even a much sloppier inner loop will still run at memory speed when there is this little work to do per element.

Resulting perf measurements are:

JITed code elapsed: 0.187149 s
JITed bandwidth: 5.73737 GiB/s
C++ ref elapsed: 0.196158 s
C++ ref bandwidth: 5.47385 GiB/s

Both the JITed Winter code and the C++ code are memory limited so run at similar speeds.

Taking another example, where more work is done per array element:

def f(float x) float : pow(x, 2.2)	
def main(array<float, 268435456> a, array<float, 268435456> b) array<float, 268435456> : 
    map(f, a)

pow() is a pretty slow function (100 cycles or more). This allows us to see the influence of auto-parallelisation.

Winter auto-parallelises this map function, dividing the array into slices, and passing each slice to a worker thread, on which the actual work is done.

With auto-parallelisation disabled, results are as follows:

pow(x, 2.2), single threaded map():
JITed code elapsed: 6.32286 s
JITed bandwidth: 0.169819 GiB/s
C++ ref elapsed: 1.90434 s
C++ ref bandwidth: 0.563838 GiB/s

The reason the C++ version is faster, is because Visual C++ vectorises the pow() function, whereas LLVM doesn't (I just filed a bug report on this here.)

With auto-parallelisation enabled:

pow(x, 2.2), multi threaded map()
JITed code elapsed: 1.35347 s
JITed bandwidth: 0.793327 GiB/s
C++ ref elapsed: 1.94178 s
C++ ref bandwidth: 0.552969 GiB/s

You can see from the highlighted result that the program runs about 4.67x faster, which is roughly what you would expect using 8 threads on 8 logical cores, considering that 4 of them are hyperthreading cores.

No changes in the winter code were required to use this parallelisation, it's completely automatic.

So there you have it - a proof-of-concept of auto-vectorisation and auto-parallelisation in Winter. The code is not robust yet - I just got it working this weekend. But I think it's quite a promising result. It also shows, I think, that functional programming languages can be just as fast (or faster) than optimised, multithreaded C++ code.

Edit: Discussion on reddit.

Fuzz Testing

Posted 30 Apr 2015

I have spent the last week or so fixing bugs in Winter (our programming language) found by our new fuzz testing code.

Fuzz testing is the process of sending lots of random input to a program, to try and uncover crashes or other undesirable behaviour from the program.

If you are writing a compiler, I highly recommend fuzz testing it, it seems like a very effective way of finding bugs.

Some more details about our fuzzing code:

It's an in-process fuzzer - random programs are generated, compiled, and run if compilation succeeded (e.g. if the program is valid), all in the same process. Because it's in-process, as soon as a crash occurs, the fuzzing run is over. Each random program string is saved to disk before it is run, so to find the program that crashed the process, you just have to get the last line from the file on disk.

It's multithreaded, because you need all the power you can get to find any bugs as fast as possible.

The fuzzer runs through about 20,000 random programs per second per core, so running on 4 cores gives around 80,000 random programs per second. This relatively fast speed is possible because invalid programs are rejected early before computationally intensive stuff like LLVM optimisation and codegen starts.

Valid programs only run at about 300 programs per second per core, but most random programs are invalid, so the average speed comes out to 20K/core/s

The random program generation strategy is something like:

Randomly pick a program from the unit test suite programs (several hundred covering all language features)

Then repeatedly mutate program randomly a few times by picking from among a few options:

1) Insert a random character at a random position in the program.
2) Delete a random chunk of characters from the program
3) Copy a chunk of the program to another random position in the program.
4) Read a section of a random other program from the test suite and insert into the program.
5) Insert a random language keyword or syntactical elements (square bracket pair etc..)

As mentioned I found this fuzzing code to be extremely effective in finding bugs. I found 40-50 crash bugs that took something like a week to fix, and triggered improvements and refactorings to avoid crashes. Our programming language Winter is a lot more robust now.

if you are wondering how there can be so many bugs in a language compiler/runtime, bear in mind that Winter is quite complex, with generic functions, operator overloading, function overloading, type coercion, JITing, higher order functions etc..

I also tried out American Fuzzy Lop (AFL). I ran it for 24 hours or so, in which time it got through 20% or so of the first pass. It didn't find any crashes. I'm not sure if this is because it's not very good at finding crashes, or if I removed all the crash bugs :) It's definitely a lot slower than our in-process multithreaded fuzzing though.

Note that in theory AFL uses a more powerful kind of fuzzing - coverage-guided fuzzing, which can be considered a white-box approach, since it uses knowledge of what code paths are taken for each input program.

Therefore I would highly recommend integrating an in-process fuzzer into your test suite, and after that maybe check out AFL or similar.

(If you want to know more about Winter there are some presentation slides in PDF format from a talk I gave a while ago here: Indigo Shader Language and Winter)

Finding the fastest Turing Machine that computes a given function is uncomputable. (And the consequences for superoptimisation)

Posted 14 Mar 2015

Assume, for a contradiction, that we have some function optimise(f), that returns the fastest Turing Machine that computes the (partial) function f.

The exact definition of 'fastest' doesn't matter.

Now consider the function $ \operatorname{returnOneOnHalt_f}(x) $, that simulates the execution of function f on input x. If f halts, then $ \operatorname{returnOneOnHalt_f} $ returns 1 and halts. If the function f doesn't halt then $ \operatorname{returnOneOnHalt_f} $ doesn't halt.

Now consider what happens when we run optimise() with $ \operatorname{returnOneOnHalt_f} $ as input, e.g. $ \operatorname{optimise}( \operatorname{returnOneOnHalt_f} $).

For a given function f, if f halts on all inputs, then $ \operatorname{returnOneOnHalt_f}(x) = 1 $. Therefore the fastest Turing machine that computes this function is the one that just writes 1 to the tape and then halts.

On the other hand, if f does not halt on all inputs, then there is at least one input value for which the optimsed $ \operatorname{returnOneOnHalt_f}$ does not halt. This Turing machine is more complex than the machine that just writes one and halts. For example it will execute more than one step as it does not halt.

Therefore we have created an effective procedure for determining if f halts on all inputs - just look at the optimised $ \operatorname{returnOneOnHalt_f} $ and see if it is the 'write one to tape and halt' machine.

However we know that such an effective procedure is impossible due to the Halting Theorem. Therefore we have a contradiction, and we cannot have such a function optimise().

Pseudocode for such a procedure looks like this:


function makeReturnsOneOnHalt(f)
	return function that simulates f, and when f halts, returns 1.
end

function doesHaltOnAllInputs(f)
	returnsOneOnHalt_f = makeReturnsOneOnHalt(f)
	
	optimised_returnsOneOnHalt_f = optimise(returnsOneOnHalt_f)
	
	return optimised_returnsOneOnHalt_f == returnsOne
end

Consequences for superoptimisation

Superoptimisation is a compiler optimisation technique that is supposed to find the shortest, or fastest program to compute a function. In other words, it doesn't just make a program somewhat faster, but actually finds the optimal program.

The term was introduced in the paper 'Superoptimizer -- A Look at the Smallest Program' by Henry Massalin in 1987. Interestingly the paper defines superoptimisation to give the shortest program, as opposed to the fastest.

If the programs being considered are 'Turing-complete' (or 'Turing-machine equivalent'), then either kind of superoptimisation (optimising for program length of for program speed) become impossible (uncomputable).

Finding the shortest program is uncomputable as a consequence of Kolmogorov Complexity theory.

As the proof above demonstrates, finding the fastest program that computes a given function is also uncomputable. This means that superoptimisation must necessarily be restricted to subsets of all possible functions, such as the total functions.

A limitation of LLVM with regard to marking sret functions as pure/readonly

Posted 14 Mar 2015

I found an interesting limitation of LLVM recently: llvm.org/bugs/show_bug.cgi?id=22853

Reassociation optimisations in Visual Studio 2012

Posted 30 Dec 2014

Multiplication in C++ is left to right associative - e.g.

y = a * b * c

means

y = (a * b) * c

Note that although the mathematical operation of multiplication is associative, e.g. (a * b) * c = a * (b * c), the same is not true for floating point multiplication, due to intermediate rounding error.

Therefore in default circumstances, the compiler may not optimise

y = a * b * c

y = a * (b * c)

as it may give different results. However, if 'fast math' mode is enabled, compilers may in theory perform this 'reassociation'.

This can be important for a number of reasons - one is that b and c may be constants, and the y = a * (b * c) may be inside a loop. Performing reassociation means that the b*c multiplication may be 'hoisted' outside the loop, so only one multiplication is done instead of two.

Clang performs this reassociation optimisation:


float f()
{
  	const float b = 2.f;
	const float c = 3.f;
	float sum = 1.f;
	for(int i=0; i<1024; ++i)
	{
		const float a = (float)(sum + 1.f);
	
		const float y = a * b * c;
		sum += y;
	}
  
  return sum;
}

gives:

.LCPI0_0:
	.long	1065353216              # float 1
.LCPI0_1:
	.long	1086324736              # float 6
f():                                  # @f()
	movss	.LCPI0_0(%rip), %xmm1
	movl	$1024, %eax             # imm = 0x400
	movss	.LCPI0_1(%rip), %xmm2
	movaps	%xmm1, %xmm0
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
	movaps	%xmm0, %xmm3
	addss	%xmm1, %xmm0
	mulss	%xmm2, %xmm0
	addss	%xmm3, %xmm0
	decl	%eax
	jne	.LBB0_1
	ret

(See the program on http://gcc.godbolt.org/)

Note that there is only one multiplication (the mulss instruction) inside the loop. Also b * c is precomputed. (note the 'float 6' constant).

Visual Studio 2012, however, gives:

static float func()
{
00000001408750A0  sub         rsp,18h  
  	const float b = 2.f;
	const float c = 3.f;
	float sum = 1.f;
00000001408750A4  movss       xmm4,dword ptr [__real@3f800000 (0140D9C67Ch)]  
00000001408750AC  movss       xmm5,dword ptr [__real@40000000 (0140D9CBF8h)]  
00000001408750B4  movaps      xmmword ptr [rsp],xmm6  
00000001408750B8  mov         eax,80h  
00000001408750BD  movaps      xmm3,xmm4  
00000001408750C0  movss       xmm6,dword ptr [__real@40400000 (0140D9CD58h)]  
00000001408750C8  nop         dword ptr [rax+rax]  
	for(int i=0; i<1024; ++i)
	{
		const float a = (float)(sum + 1.f);
00000001408750D0  movaps      xmm0,xmm3  
00000001408750D3  addss       xmm0,xmm4  
	
		const float y = a * b * c;
00000001408750D7  mulss       xmm0,xmm5  
00000001408750DB  mulss       xmm0,xmm6  
		sum += y;
00000001408750DF  addss       xmm3,xmm0  
00000001408750E3  movaps      xmm1,xmm3  
00000001408750E6  addss       xmm1,xmm4  
00000001408750EA  mulss       xmm1,xmm5  
		sum += y;
00000001408750EE  mulss       xmm1,xmm6  
00000001408750F2  addss       xmm3,xmm1  
00000001408750F6  movaps      xmm0,xmm3  
00000001408750F9  addss       xmm0,xmm4  
00000001408750FD  mulss       xmm0,xmm5  
0000000140875101  mulss       xmm0,xmm6  
0000000140875105  addss       xmm3,xmm0  
0000000140875109  movaps      xmm1,xmm3  
000000014087510C  addss       xmm1,xmm4  
0000000140875110  mulss       xmm1,xmm5  
0000000140875114  mulss       xmm1,xmm6  
0000000140875118  addss       xmm3,xmm1  
000000014087511C  movaps      xmm0,xmm3  
000000014087511F  addss       xmm0,xmm4  
0000000140875123  mulss       xmm0,xmm5  
0000000140875127  mulss       xmm0,xmm6  
000000014087512B  addss       xmm3,xmm0  
000000014087512F  movaps      xmm1,xmm3  
0000000140875132  addss       xmm1,xmm4  
0000000140875136  mulss       xmm1,xmm5  
000000014087513A  mulss       xmm1,xmm6  
000000014087513E  addss       xmm3,xmm1  
0000000140875142  movaps      xmm2,xmm3  
0000000140875145  addss       xmm2,xmm4  
0000000140875149  mulss       xmm2,xmm5  
000000014087514D  mulss       xmm2,xmm6  
0000000140875151  addss       xmm3,xmm2  
0000000140875155  movaps      xmm1,xmm3  
0000000140875158  addss       xmm1,xmm4  
000000014087515C  mulss       xmm1,xmm5  
0000000140875160  mulss       xmm1,xmm6  
0000000140875164  addss       xmm3,xmm1  
0000000140875168  dec         rax  
000000014087516B  jne         func+30h (01408750D0h)  
	}
  
	return sum;
0000000140875171  movaps      xmm0,xmm3  
}
0000000140875174  movaps      xmm6,xmmword ptr [rsp]  
0000000140875178  add         rsp,18h  
000000014087517C  ret

Although VS has unrolled the loop a bit, you can see that there are two multiplications done, with a dependency between them, e.g. the first has to finish before the second is done. So VS doesn't seem to do the reassociation optimisation, even with fast maths enabled (/fp:fast), which is unfortunate.

EDIT: Bug posted here: https://connect.microsoft.com/VisualStudio/feedback/details/1075102.

Also optimisation still not done in VS 2013.

Some wave simulation animations

Posted 30 Sep 2014

RDTSC doesn't actually measure the number of clock cycles elapsed

Posted 3 Jul 2014

The RDTSC instruction on x86 CPUs is supposed to read a counter that is incremented (by one) every clock cycle.

However, it doesn't actually do this. Instead, the counter increases at a fixed frequency, independent of the actual clock speed of your CPU, which varies based on load, temperature etc..

This does not seem to be documented anywhere. I found this out the hard way, after measuring the frequency of rdtsc myself, and noticing it differs from that given by CPU-z. The CPU-z value seems greatly more likely to be the correct value, since it correlates with temperature values of the CPU (reported with speedfan) and actual amount of work done by the CPU (speed of execution of a program).

On my i7 920, which has a 'on the box' clock speed of 2.67 GHz, RDTSC increments at ~2.664 GHZ. The actualclock speed varies from 1.5 GHz at idle, to ~2.8GHz under load (measured with CPU-z).

There's a great summary of why RDTSC does this in this stack overflow answer http://stackoverflow.com/a/19942784 by 'Brendan'.

In summary: some idiot at some point got confused by changing clock speeds, and complained as it broke their dodgy program, so AMD and Intel changed the semantics of RDTSC to run at a fixed freq.

Meanwhile I found out completely by accident after several years of using RDTSC.

To me this is good example of the terrible situation modern computer and software 'engineering' is in. Changes of semantics like this are not documented properly, or even at all, and you only find out years later by accident, and can only confirm it and find an explanation in some random stack overflow post.

The invariance theorem

Some people brought up the invariance theorem in defense of some global notion of Kolmogorov complexity. While the theorem may be true, what it says is basically useless in trying to define a global Kolmogorov complexity - i.e. a complexity that is independent of the language chosen.

Let's walk through an example of how the invariance theorem fails to do anything helpful. I'll use the notation from my previous post.

First a definition of the invariance theorem (from Wikipedia):

If $ K_1 $ and $ K_2 $ are the complexity functions relative to Turing complete description languages $L_1$ and $L_2$, then there is a constant $c$ – which depends only on the languages $L_1$ and $L_2$ chosen – such that $$ \forall s. -c \le K_1(s) - K_2(s) \le c $$

The proof follows from considering an interpreter for $L_1$ written in $L_2$ and vice-versa.

Ok, so that's the invariance theorem. Now lets work through an example of why it is useless. Consider some arbitrary string $x$, for example: $$ x=ababaaabbabaabaababbbabaababbbaaababbbba $$ Let us suppose the complexity wrt. language $L_1$ is 20, e.g. $$ K_1(x) = 20 $$ Now consider the language $L_{silly, x}$, defined as in the previous post. This is the language specifically chosen to make the complexity of x zero in the language. From the definition of $L_{silly, x}$, we have $$ K_{silly, x}(x) = 0 $$ So, what does the invariance theorem say about the two languages we have chosen? ($L_1$ and $L_{silly, x}$) The invariance theorem says that there is some constant $c$ such that $$ \forall s. -c \le K_1(s) - K_{silly, x}(s) \le c $$ or using our values $K_1(x) = 20$ and $K_{silly, x}(x) = 0$: (since the statement holds for all strings) $$ -c \le 20 - 0 \le c $$ so $$ -c \le 20 \le c $$ so $$ c \ge 20 $$ In other words, for this particular string $x$, there is some language $L_{silly, x}$ such that the complexity of x wrt. the language is zero, and the bound ($c$) for this pair of languages is large enough to cover the gap in complexities between the two languages. So the invariance theorem has not stopped the complexity of $x$ going to zero in some other language, even though it is non-zero in $L_1$.

An alternative definition of Kolmogorov complexity

There are alternative definitions of Kolmogorov complexity, for example the one by given by Peter Bro Miltersen in his course notes.

These alternative definitions also suffer from the same problem - the complexity depends on the language chosen, and there is no reasonable 'invariant' complexity.

Recapping his definitions, we have programming systems (languages) $L$, and programs in these languages: $p \in L$. These are given semantics in the following way: Each language $L$ has two associated algorithms $\phi$, $\tau$ so that $\phi(p)$ for $p \in L $ outputs the index of a Turing machine $M_{\phi(p)}$. So $p$ is given the behavior of the associated Turing machine: $$ p(x) := M_{\phi(p)}(x) $$ A Kolmogorov decoder relative to L is then defined: $$ d_K(pz) := p(z) $$ This function works by splitting the string $pz$ into two components - the description of the program $p$, and the input to the program, $z$. For this to work, no program string $p$ can be the prefix of another one.

The Kolmogorov code is then defined: $$ c_K(x) := \text{the shortest y so that } d_K(y) = x $$ Note that in in this definition $y$ is not a program, but the definition of a program along with input to the program, concatenated together.

Finally the Kolmogorov complexity is given as $$ K(x) := |c_K(x)| $$ e.g. the length of the Kolmogorov code.

Using these definitions, we can make the following theorem:

Theorem: For any finite string $x$, there is some language $L_{silly, x}$ such that the Kolmogorov complexity of $x$ in $L_{silly, x}$ is 1.

Proof: For the given string $x$, there is some Turing machine that outputs $x$ then halts, when given the empty string input $\varepsilon$. Suppose this Turing machine has index i. Define the associated semantics algorithm $\phi$ for the language such that $$ \phi(p_1) := i $$ Where $$ p_1 = ``1" $$ E.g. $p_1$ is the string $``1"$. Which means that $$ p_1(\varepsilon) = M_{\phi(p_1)}(\varepsilon) = M_i(\varepsilon) = x $$ So $$ d_K(p_1\varepsilon) = p_1(\varepsilon) = x $$ Given that $$ c_K(x) = \text{the shortest y so that} d_K(y) = x $$ we have $$ c_K(x) = p_1\varepsilon$$ $$ c_K(x) = ``1" $$ so $$ K(x) := |c_K(x)| = 1 $$ The reason we have a complexity of one instead of zero is due to the prefix-free property required when combining the two strings for $d_K$.

Decidability of finding the fastest Turing machine to compute a function with a finite domain

Posted 8 Feb 2014

Let's say we have some (total) function $f$, defined on some finite domain $A$: $$ f : A \mapsto B $$ For example, f might be a square function defined on $ \{1, 2, 3, 4\} $:

Input    Output
1        1
2        4
3        9
4        16

The question now arises, is it possible to define an effective procedure that will determine the fastest Turing machine that computes the function $f$.

Without any limitations on the Turing machine, we can simply construct a Turing machine that computes f in one step - you basically encode f into the transition function of the Turing machine.

Theorem: There is an effective procedure for finding the fastest Turing machine that computes a function f with finite domain. The fastest Turing machine takes just 1 step.

Proof: To do this we construct a machine $T_1$ that has a different tape symbol for each input value to the function - e.g. element in the domain $ A $. The transition function of $T_1$ then simply maps from the read element $a$ to the output element $ f(a) $. The machine $T_1$ writes $ f(a) $ to the tape, overwriting $ a $, then halts. Q.E.D.

I don't find this a very interesting proof, since it doesn't really follow the ethos of Turing machines - which it violates by throwing more table symbols at the problem until it can do it in one step. A more interesting problem is the case where the tape alphabet $ \Gamma $ is fixed - e.g. to the binary case, where it consists of just the set $ \{0, 1, b\} $ - zero, one, and blank. This corresponds a lot better to 'real' computers. So let's fix the tape alphabet and ask the question again:

Given a fixed tape alphabet $ \Gamma $, is there an effective procedure for finding the fastest Turing machine that computes the given function $f$, where $f$ has a finite domain.

Let's also say that we have some 'reference' Turing machine $T_{ref}$ that computes this function. Such a reference Turing machine can always be constructed by simply handling each possible input value as a special case and returning the corresponding output value. In normal programming pseudocode it could look something like

if input == 1
  return 1
else if input == 2
  return 4
else if input == 3
  return 9
else if input == 4
  return 16

First we need to define what it means to be fastest. I'll define the speed of a Turing machine that computes such a function (finite domain, total) to be the greatest number of steps that the machine takes over all of its inputs - so it's a worst case measure.

Given that definition of speed, the fastest turing machine that computes the function f will be the machine that takes the least number of steps for its worst case input - e.g. the input that it takes the largest number of steps on.

Theorem: Given a fixed tape alphabet $ \Gamma $, There is an effective procedure for finding the fastest Turing machine that computes the given function $f$, where $f$ has a finite domain. (e.g. the problem is decidable)

Proof: To prove this, we will first show that there is a finite number of effectively-different Turing machines that need to be considered. Then the fastest can be found by testing each Turing machine in turn, and selecting the fastest.

Suppose the worst case number of steps for the reference Turing machine $ T_{ref} $ is $N_{ref}$.

Lemma: The maximum number of states that may be visited after N steps is $$ |\Gamma| ^ N $$ where $ |\Gamma| $ is the number of different tape symbols.

Proof: After zero steps the machine is in the initial state. From the initial state, it may transition to a different state based on each possible tape symbol, so after 1 step it may be in $ |\Gamma| $ states. Likewise, after two steps it may be in one of $ |\Gamma|^2 $ states. So after N steps it may be in one of up to $ |\Gamma|^N $ states. Q.E.D.

Consider any Turing machine faster than $ T_{ref} $. Since its worst case number of steps will be less than $N_{ref}$, when run on any input in A, the number of possible states it may be in is less than or equal to $ |\Gamma|^ {N_{ref}} $.

Therefore we need only consider Turing machines with number of states less than or equal to $ M = |\Gamma|^ {N_{ref}} $.

Any machine with more states than $M$ will be effectively the same as one of the machines with number of states <= $M$, since the extra states can only be reached after more than $N_{ref}$ steps.

The procedure to find the fastest Turing machine $T_{f}$, then is as follows:

Consider each Turing machine $T_{i}$ with $M$ states, and a given transition function using the $M$ states. The number of such Turing machines is finite, and bounded by a exponential function of $M$ and $|\Gamma|$.

For each element $a$ in the domain $A$, run $T_{i}$ with input $a$, for up to $N_{ref}$ steps. If it halts with incorrect output, reject the machine. If it does not halt after $N_{ref}$ steps, we can reject the machine, as it will not be faster than $T_{ref}$, and therefore cannot be the fastest Turing machine. If it computes $f(a)$ correctly for all elements in $A$, and it has smallest worst-case running time of all Turing machines considered so far, remember it.

Once all potential Turing machines have been considered, the one with the lowest worst-case running time that computes $f$ correctly will be the fastest Turing machine that computes $f$ we are looking for. Q.E.D.

Remarks

There are a couple of interesting things about this proof.

First, we avoid the issue you usually get when enumerating over Turing machines run on some input - that the machine may not halt. We sidestep this problem as we have an upper bound on the number of steps to run the machine for.

Secondly, a lot of functions we deal with in computer programs are defined on finite domains, especially functions that take as input finite precision integers or floating point numbers. For example, the number of single precision floating point values is of course finite. So in theory, we can write a program to compute, for example, the fastest sqrt(float) function.

There's a humorous theorem called the 'Full employment theorem', which says that writing a compiler that outputs the smallest program for some function is impossible, as the problem is uncomputable. However, this article has shown that finding the fastest program is computable, at least for a finite domain.

A related question is if finding the fastest program for infinitely sized domains, e.g. the integers, is computable. I'm not sure about this one, let me know if you have a proof either way :)

Edit, 7th May 2014: Reddit user wildeye pointed me at Manual Blum's work: A Machine-Independent Theory of the Complexity of Recursive Functions (PDF) which has some things to say about bounds on more general functions.

Problems with JITing and LLVM #1 - large offsets

Posted 15 Jan 2014

Looking through my LLVM Jitted code, I noticed some calls to cos that looked like this:

movabsq	$cosf, %rax
movaps	%xmm6, %xmm0
callq	*%rax

Whereas as some usual non-jitted code calling cos looks like this:

call        cosf (0140C1E9F4h)

It's unfortunate that the jitted code requires two instructions to do what the normal code can do in one, and that the call is now indirect.

After asking on the LLVM IRC channel, I think I understand why.

The reason for this is that LLVM conservatively (and correctly in my tests) assumes that cosf is > 2^32 bytes away from the location of the call instruction, and so cannot be jumped to directly with the call instruction, as it uses 32 bit relative offsets.

See http://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models/ for more information.

This could be improved in LLVM in the following way (I'm sure there are other possibilities as well)

Try and allocate code for the Jitted code to go around the rest of the executable code. This seems to be around 0140C1E9F4h in my tests, which is about 5GB through the address space.

If the jitted code is succesfully allocated near e.g. the cos call, then emit a direct call instruction.

Allocating memory at a specific address from the kernel seems to be possible with mmap (and the Windows equivalent) although I haven't tried it.

Edit, 7th May 2014: On Windows, VirtualAlloc takes as its first parameter the address at which to allocate memory, which indicates that allocating at certain addresses should be possible.

Compression

Posted 27 Oct 2013

For any given finite string, there is a compression method (an invertible (one-to-one) function) for which the finite string is compressed to the empty string.

Proof:

Given a finite string i, we can construct the one-to-one function $ C_i $: $$ C_i(x) = \begin{cases} \epsilon &\mbox{if } x = i \\ i &\mbox{if } x = \epsilon \\ x &\mbox{otherwise} \end{cases} $$ which means that $$C_i(i) = \epsilon$$ i.e. the input string $ i $ is mapped to the empty string $ \epsilon $, and $$C_i(\epsilon) = i$$ i.e. the empty string $ \epsilon $ is mapped to $ i $, and for all other strings $$C_i(x) = x$$ i.e. we have the identity mapping.

The Conservative Halting Problem

Posted 26 Oct 2013

Turing proved with his halting theorem (There's a nice simple proof here) that there is not a correct program $$Halt(p, i)$$ that returns true if program p halts on input i, and returns false if the program does not halt (i.e. if it loops forever). (This is for Turing-complete languages, i.e. languages with programs that may loop forever). However, what does exist, is a correct program HaltM (short for Halt Maybe), that returns one of 3 options: Halts, Unknown, or DoesNotHalt. To show that such a program does exist, we can do a proof-by-construction. Here's the first example of such a program:

HaltM_1(p, i): 
   return Unknown

This program just always returns Unknown regardless of the input. Admittedly that's pretty lame :) To make a slightly more interesting HaltM program, first let's introduce two programs, one which halts on all inputs, one which loops forever on all inputs.

HaltsOnAll(x):  
   return 1

LoopForever(x):  
   Loop Forever

Now we can define a slightly more interesting HaltM program, that doesn't always return Unknown:

HaltM_2(p, i): 
   if p == HaltsOnAll then
      return Halts
   else if p == LoopForever then
      return DoesNotHalt
   else
      return Unknown
   end

What was the point of this proof? I get the feeling that some people seem to think that for Turing-complete languages, it's impossible to say anything about if the program halts or not, just by examining it. This simple counterexample shows, that although you can't write a program Halt(p, i) that can decide for sure in all cases, you can write a program HaltM_2(p, i) that can tell if the program halts or does not halt in some cases.

Bézier curves

Posted 10 Aug 2013

Constructed and rendered in Indigo.

Kolmogorov Complexity - it's a bit silly

Posted 25 Jun 2013

Kolmogorov Complexity is kind of a cool idea, at least at first glance.

You may know that the information content of an infinite stream of random symbols (measured in bits per symbol) is given by the entropy as described by Claude Shannon.

The entropy is all well and good, but it would be nice to have a measure for the information content in a finite length string.

Kolmogorov Complexity gives a nice description of the information content of a finite string by how much information it takes to describe the string.

Take the two strings:

$$abababababababababababababababababababab$$

$$ababaaabbabaabaababbbabaababbbaaababbbba$$

The first string can be described as 'ab repeated 20 times'. The second string, since it is more-or-less random, can't be described in a simpler way than just reciting the whole string.

Kolmogorov Complexity says that the first string has a lower complexity, or information content, relative to the second string.

A little more formally, suppose we have some programming language L. (For example, L could be Ruby) L takes as input program strings p, that when run by the language L, produce an output string $$L(p)$$ (if the program halts).

Take a string x. x could be the output of many different programs p. However, there will be one (or more) programs that are shorter than any other programs that produce x. The length of such a p (call it |p|), is the Kolmogorov complexity of x.

The Kolmogorov complexity of a string, for some language L, is thus defined to be $$K_L(x) = |p|$$ where p is the shortest program string such that $$L(p) = x$$

Let's do some examples with L = Ruby.

Take the string x_1 =

$$abababababababababababababababababababab$$

that we saw above. This can be output with the ruby program

puts "ab"*20

This program is 11 bytes long. Let's work in bytes for now, since it's not really possible to have non-byte length programs. So for this program, call it p_1, $$|p_1| = 11$$ There may be a shorter program that can spit out the string x above (Maybe some Ruby experts will know one). However, we do at least have an upper bound on the Kolmogorov complexity of x_1 for the ruby language: $$K_{ruby}(x_1) <= 11$$ Now let's consider the other string, x_2 =

$$ababaaabbabaabaababbbabaababbbaaababbbba$$

The shortest program that will spit this out (that I know of) is

puts"ababaaabbabaabaababbbabaababbbaaababbbba"

Which is 46 bytes long.

So we know $$K_{ruby}(x_2) <= 46$$ If we accept that these upper bounds are reasonably tight (which I think is reasonable), then we can conclude that the first string has a much lower Kolmogorov complexity than the second, as we said earlier.

At this point, you may reasonably ask, 'but what about other languages such as C++ or Perl?' Obviously since C++ is so verbose, you are not going to do better in terseness than Ruby. But what about 'line-noise as syntax' Perl? It's entirely possible that $$K_{perl}(x_1) < K_{ruby}(x_1)$$ In general the Kolmogorov complexity of a particular string is going to vary with the programming language L. Isn't this a problem? Wikipedia will tell you something like "The length of the shortest description will depend on the choice of description language; but the effect of changing languages is bounded (a result called the invariance theorem)." The effect of changing languages is bounded because the first language can interpret the second, and vice-versa (assuming the languages are universal). For example, if we had a nice Perl program that spits out a string, we could write a Perl interpreter in Ruby, concatenated with the original Perl program. The length of the Perl program is the constant described above.

The overall Kolmogorov complexity of a string is thus defined as $$K(x) = |p|$$ where p is the shortest program string for language L such that $$L(p) = x$$ and we consider all programming languages.

At this point, people seem to think the problem solved. There's just one little hitch. When dealing with finite length strings, constant factors are important, especially when they can be much longer than the strings in question. Anyway, to show why Kolmogorov complexity is a bit silly, I will define a class of new programming languages $$L_{verysilly,i}$$ such that $$L_{verysilly,i}$$ is

puts i

In other words, all the programming language does is discard any program input and print out the string i. So, let's have another look at the Kolmogorov complexity of $$x_2 = ababaaabbabaabaababbbabaababbbaaababbbba$$ Given that if we feed the empty program string into it we get x_2: $$L_{verysilly,x_2}('') = x_2$$ Then the shortest program that produces x_2 is obviously the empty string '', which has length zero, so $$K_{verysilly,x_2}(x_2) = 0$$ In general, for any (finite) string x, there exists some very silly language $$L_{verysilly,x}$$ such that it spits out x: $$L_{verysilly,x}('') = x$$ and so the Kolmogorov complexity of x is zero in this language. $$K_{verysilly,x}(x) = 0$$ Perhaps you might protest that such very silly languages aren't 'proper' programming languages like Ruby, since they ignore the program string completely, and can only produce one output. And that would be quite reasonable. Let us restrict ourselves to universal languages then (e.g. languages that can simulate any other programming language / Turing machine) We can introduce the slightly less silly language class $$L_{silly,i}$$ Where $$L_{silly,i}$$ is defined as the programming language (implemented in Ruby):

arg = ARGV[0]

if(arg.nil?)
	puts i
else
	eval(arg)
end

What this does, is if the input string is empty (I'm checking for Null here as Ruby or Windows seem to confuse the empty string with the null string), then it prints out the string i and terminates. Otherwise it evaluates/executes the string as a Ruby program.

To take a concrete example, the language $$L_{silly,x_2}$$ is

arg = ARGV[0]

if(arg.nil?)
	puts "ababaaabbabaabaababbbabaababbbaaababbbba"
else
	eval(arg)
end

If you pass the empty string to this language, it prints out the x_2:

PS C:\programming\Kolmogorov> ruby .\interpreter.rb ""
ababaaabbabaabaababbbabaababbbaaababbbba

However it can also evaluate arbitrarily complex programs:

PS C:\programming\Kolmogorov> ruby .\interpreter.rb "puts(\`"ab\`" * 20)"
abababababababababababababababababababab

So even with this universal language, we still have $$K_{silly,x_2}(x_2) = 0$$ So to go over what we have constructed:

Theorem:

The Kolgomorov complexity of all (finite) strings is zero.

Proof:

Take any string i

Consider the language $$L_{silly, i}$$ Where $$L_{silly, i}('') = i$$ (The empty string case) and $$L_{silly, i}(x) = L_{ruby}(x)$$ if x is not the empty string.

Since $$|''| = 0$$ therefore $$K_{silly, i}(i) = 0$$ and therefore $$K(i) = 0$$ q.e.d.

So Kolmogorov complexity as it's currently defined has restricted utility. Maybe if you restrict the language to just a single language then it is useful. But if you allow any language to be used you run into the silly situation described above.

Maybe there's some variation on Kolmogorov complexity out there that avoids the dependence on a particular language, while still capturing the neat idea at the heart of it.

The most annoying thing about Windows

Posted 17 Jun 2013

... from a dev perspective at least. When you allocate an amount larger than the system RAM for some reason, maybe by accident, or maybe because MSVC miscompiles your code. Then Windows proceeds to page everything out to disk, including task manager, which means you can't shut down the offending program. Your computer then becomes completely unresponsive. 5 Minutes later, when you can finally kill the program, then every other program competes in a fight to the death to read in 1 byte of paged memory from disk at a time, which takes forever.

A couple of ideas for the Windows devs:
* Don't page task manager to disk, ever.
* Get a decent IO scheduler so programs don't read like 1 byte between a disk seek.

MathJax

Posted 28 Apr 2013

I'm going to try using MathJax for displaying LaTeX maths in my blog posts. It seems to work pretty well.

The only significant problem is the flash of unrendered latex code for 200ms or so when viewing a post.

$$x = {-b \pm \sqrt{b^2-4ac} \over 2a}$$

Sockets should have a flushHint() API call.

Posted 28 Apr 2013

The (Berkeley) sockets API should really have a function call to tell the implementation to do a local flush if possible, e.g. to stop buffering data and send the current packet.

Let's say you are writing a webserver. Your webserver has received an HTTP request. You have read it in, decided on what page to return in response, and are now piecing the page together out of pieces of data and HTML. After each bit of HTML has been created you write it to the socket with send(). At some point, you write the final "</html>" tag. At this point you have finished writing the response, and you want the computer to send it back to the client ASAP.

However, since there is no flush method in the sockets API, you can't tell it to send immediately. Instead the sockets implementation just dicks around for say 0.2 seconds (or whatever the implementation timeout value is) before it decides to finally send the last packet. This seems quite silly to me. There should be some kind of flushHint() API call that would tell the implementation to send the last packet ASAP.

Workaround

Why does MSVC do this?

Conclusion

Why this happens

Impact

Types of performance bugs

Defining a performance bug

How to find performance bugs

Conclusion

Proposed fix

Results with the proposed fix

Proposed solution

More results

More results on other platforms

The standard story

With absolute space and time

Special relativity and absolute time and space

Some results

Limitations of radix sort

Conclusion

Further reading

Fixed length outputs

One-way functions

Implications

More reading:

Sorting algorithms

Stirling's approximation

Current option 1: pointer casting

Current option 2: union abusing

Current option 3: memcpy

Current (non-)option 4: static_cast

Current (non-)option 5: reinterpret_cast

Semantics

Conclusion

Consequences for superoptimisation

The invariance theorem

An alternative definition of Kolmogorov complexity

Remarks