The weblog of Nicholas Chapman Auto-vectorisation and parallelisation in WinterPosted 24 May 2015I have reached an interesting milestone in our functional programming language Winter - Winter now can do auto-parallelisation, auto-vectorisation, and even both at the same time :) Auto-parallelisation means that code is automatically executed in parallel in multiple threads. Auto-vectorisation means that vector (SIMD) instructions are automatically used instead of scalar instructions. This means that e.g. 4 or 8 multiplications can be done with one instruction. Taking some Winter code, which applies a function to an input array, generating a result array: def square(float x) float : x*x def main(array a, array b) array : map(square, a)  We're using a large array size here so we get a decent, measurable amount of work to do. The assembly from the inner loop looks like this: .LBB2_3: # %vector.body # =>This Inner Loop Header: Depth=1 vmovaps (%rdx,%rax), %ymm0 vmovaps 32(%rdx,%rax), %ymm1 vmovaps 64(%rdx,%rax), %ymm2 vmovaps 96(%rdx,%rax), %ymm3 vmulps %ymm0, %ymm0, %ymm0 vmulps %ymm1, %ymm1, %ymm1 vmulps %ymm2, %ymm2, %ymm2 vmulps %ymm3, %ymm3, %ymm3 vmovaps %ymm0, (%rcx,%rax) vmovaps %ymm1, 32(%rcx,%rax) vmovaps %ymm2, 64(%rcx,%rax) vmovaps %ymm3, 96(%rcx,%rax) subq $-128, %rax addq$-32, %r10 jne .LBB2_3  Very nice code here, LLVM is doing a good job. You can see there are 4 256-bit memory loads being done in parallel, then 4 256-bit 8-element multiplications being done in parallel, then 4 stores being done. This code runs at memory speed, which is about 5.7 GB/s in this case on my machine. Even a much sloppier inner loop will still run at memory speed when there is this little work to do per element. Resulting perf measurements are: JITed code elapsed: 0.187149 s JITed bandwidth: 5.73737 GiB/s C++ ref elapsed: 0.196158 s C++ ref bandwidth: 5.47385 GiB/s  Both the JITed Winter code and the C++ code are memory limited so run at similar speeds. Taking another example, where more work is done per array element: def f(float x) float : pow(x, 2.2) def main(array a, array b) array : map(f, a)  pow() is a pretty slow function (100 cycles or more). This allows us to see the influence of auto-parallelisation. Winter auto-parallelises this map function, dividing the array into slices, and passing each slice to a worker thread, on which the actual work is done. With auto-parallelisation disabled, results are as follows: pow(x, 2.2), single threaded map(): JITed code elapsed: 6.32286 s JITed bandwidth: 0.169819 GiB/s C++ ref elapsed: 1.90434 s C++ ref bandwidth: 0.563838 GiB/s  The reason the C++ version is faster, is because Visual C++ vectorises the pow() function, whereas LLVM doesn't (I just filed a bug report on this here.) With auto-parallelisation enabled: pow(x, 2.2), multi threaded map() JITed code elapsed: 1.35347 s JITed bandwidth: 0.793327 GiB/s C++ ref elapsed: 1.94178 s C++ ref bandwidth: 0.552969 GiB/s  You can see from the highlighted result that the program runs about 4.67x faster, which is roughly what you would expect using 8 threads on 8 logical cores, considering that 4 of them are hyperthreading cores. No changes in the winter code were required to use this parallelisation, it's completely automatic. So there you have it - a proof-of-concept of auto-vectorisation and auto-parallelisation in Winter. The code is not robust yet - I just got it working this weekend. But I think it's quite a promising result. It also shows, I think, that functional programming languages can be just as fast (or faster) than optimised, multithreaded C++ code. Edit: Discussion on reddit. Do you have a comment or feedback about this blog post? Please email me.< Back All content by Nicholas Chapman. Static Pages on this website Our software Indigo Renderer Chaotica