Forward Scattering - The Weblog of Nicholas Chapman

I have reached an interesting milestone in our functional programming language Winter - Winter now can do auto-parallelisation, auto-vectorisation, and even both at the same time :)

Auto-parallelisation means that code is automatically executed in parallel in multiple threads. Auto-vectorisation means that vector (SIMD) instructions are automatically used instead of scalar instructions. This means that e.g. 4 or 8 multiplications can be done with one instruction.

Taking some Winter code, which applies a function to an input array, generating a result array:

def square(float x) float : x*x	
def main(array<float, 268435456> a, array<float, 268435456> b) array<float, 268435456> : 
    map(square, a)

We're using a large array size here so we get a decent, measurable amount of work to do.

The assembly from the inner loop looks like this:

.LBB2_3:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vmovaps	(%rdx,%rax), %ymm0
	vmovaps	32(%rdx,%rax), %ymm1
	vmovaps	64(%rdx,%rax), %ymm2
	vmovaps	96(%rdx,%rax), %ymm3
	vmulps	%ymm0, %ymm0, %ymm0
	vmulps	%ymm1, %ymm1, %ymm1
	vmulps	%ymm2, %ymm2, %ymm2
	vmulps	%ymm3, %ymm3, %ymm3
	vmovaps	%ymm0, (%rcx,%rax)
	vmovaps	%ymm1, 32(%rcx,%rax)
	vmovaps	%ymm2, 64(%rcx,%rax)
	vmovaps	%ymm3, 96(%rcx,%rax)
	subq	$-128, %rax
	addq	$-32, %r10
	jne	.LBB2_3

Very nice code here, LLVM is doing a good job. You can see there are 4 256-bit memory loads being done in parallel, then 4 256-bit 8-element multiplications being done in parallel, then 4 stores being done.

This code runs at memory speed, which is about 5.7 GB/s in this case on my machine. Even a much sloppier inner loop will still run at memory speed when there is this little work to do per element.

Resulting perf measurements are:

JITed code elapsed: 0.187149 s
JITed bandwidth: 5.73737 GiB/s
C++ ref elapsed: 0.196158 s
C++ ref bandwidth: 5.47385 GiB/s

Both the JITed Winter code and the C++ code are memory limited so run at similar speeds.

Taking another example, where more work is done per array element:

def f(float x) float : pow(x, 2.2)	
def main(array<float, 268435456> a, array<float, 268435456> b) array<float, 268435456> : 
    map(f, a)

pow() is a pretty slow function (100 cycles or more). This allows us to see the influence of auto-parallelisation.

Winter auto-parallelises this map function, dividing the array into slices, and passing each slice to a worker thread, on which the actual work is done.

With auto-parallelisation disabled, results are as follows:

pow(x, 2.2), single threaded map():
JITed code elapsed: 6.32286 s
JITed bandwidth: 0.169819 GiB/s
C++ ref elapsed: 1.90434 s
C++ ref bandwidth: 0.563838 GiB/s

The reason the C++ version is faster, is because Visual C++ vectorises the pow() function, whereas LLVM doesn't (I just filed a bug report on this here.)

With auto-parallelisation enabled:

pow(x, 2.2), multi threaded map()
JITed code elapsed: 1.35347 s
JITed bandwidth: 0.793327 GiB/s
C++ ref elapsed: 1.94178 s
C++ ref bandwidth: 0.552969 GiB/s

You can see from the highlighted result that the program runs about 4.67x faster, which is roughly what you would expect using 8 threads on 8 logical cores, considering that 4 of them are hyperthreading cores.

No changes in the winter code were required to use this parallelisation, it's completely automatic.

So there you have it - a proof-of-concept of auto-vectorisation and auto-parallelisation in Winter. The code is not robust yet - I just got it working this weekend. But I think it's quite a promising result. It also shows, I think, that functional programming languages can be just as fast (or faster) than optimised, multithreaded C++ code.

Edit: Discussion on reddit.