I have reached an interesting milestone in our functional programming language Winter - Winter now can do auto-parallelisation, auto-vectorisation, and even both at the same time :)
Auto-parallelisation means that code is automatically executed in parallel in multiple threads. Auto-vectorisation means that vector (SIMD) instructions are automatically used instead of scalar instructions. This means that e.g. 4 or 8 multiplications can be done with one instruction.
Taking some Winter code, which applies a function to an input array, generating a result array:
def square(float x) float : x*x def main(array<float, 268435456> a, array<float, 268435456> b) array<float, 268435456> : map(square, a)
We're using a large array size here so we get a decent, measurable amount of work to do.
The assembly from the inner loop looks like this:
.LBB2_3: # %vector.body # =>This Inner Loop Header: Depth=1 vmovaps (%rdx,%rax), %ymm0 vmovaps 32(%rdx,%rax), %ymm1 vmovaps 64(%rdx,%rax), %ymm2 vmovaps 96(%rdx,%rax), %ymm3 vmulps %ymm0, %ymm0, %ymm0 vmulps %ymm1, %ymm1, %ymm1 vmulps %ymm2, %ymm2, %ymm2 vmulps %ymm3, %ymm3, %ymm3 vmovaps %ymm0, (%rcx,%rax) vmovaps %ymm1, 32(%rcx,%rax) vmovaps %ymm2, 64(%rcx,%rax) vmovaps %ymm3, 96(%rcx,%rax) subq $-128, %rax addq $-32, %r10 jne .LBB2_3Very nice code here, LLVM is doing a good job. You can see there are 4 256-bit memory loads being done in parallel, then 4 256-bit 8-element multiplications being done in parallel, then 4 stores being done.
This code runs at memory speed, which is about 5.7 GB/s in this case on my machine. Even a much sloppier inner loop will still run at memory speed when there is this little work to do per element.
Resulting perf measurements are:
JITed code elapsed: 0.187149 s JITed bandwidth: 5.73737 GiB/s C++ ref elapsed: 0.196158 s C++ ref bandwidth: 5.47385 GiB/s
Both the JITed Winter code and the C++ code are memory limited so run at similar speeds.
Taking another example, where more work is done per array element:
def f(float x) float : pow(x, 2.2) def main(array<float, 268435456> a, array<float, 268435456> b) array<float, 268435456> : map(f, a)
pow() is a pretty slow function (100 cycles or more). This allows us to see the influence of auto-parallelisation.
Winter auto-parallelises this map function, dividing the array into slices, and passing each slice to a worker thread, on which the actual work is done.
With auto-parallelisation disabled, results are as follows:
pow(x, 2.2), single threaded map(): JITed code elapsed: 6.32286 s JITed bandwidth: 0.169819 GiB/s C++ ref elapsed: 1.90434 s C++ ref bandwidth: 0.563838 GiB/s
The reason the C++ version is faster, is because Visual C++ vectorises the pow() function, whereas LLVM doesn't (I just filed a bug report on this here.)
With auto-parallelisation enabled:
pow(x, 2.2), multi threaded map() JITed code elapsed: 1.35347 s JITed bandwidth: 0.793327 GiB/s C++ ref elapsed: 1.94178 s C++ ref bandwidth: 0.552969 GiB/s
You can see from the highlighted result that the program runs about 4.67x faster, which is roughly what you would expect using 8 threads on 8 logical cores, considering that 4 of them are hyperthreading cores.
No changes in the winter code were required to use this parallelisation, it's completely automatic.
So there you have it - a proof-of-concept of auto-vectorisation and auto-parallelisation in Winter. The code is not robust yet - I just got it working this weekend. But I think it's quite a promising result. It also shows, I think, that functional programming languages can be just as fast (or faster) than optimised, multithreaded C++ code.
Edit: Discussion on reddit.