Jun 9, 2011

SLEEF, an excursion on Go's math

A benchmark led to a contribution, and a discovery of sincos_amd64.s, further led to a pleasant sunny Sunday afternoon, reading the SLEEF paper. Thank you Mr. Naoki Shibata for your kind email.

SLEEF has 128-bit SSE optimized code to obtain sin and cos at the same time, but sincos_amd64.s only uses 64-bit FPU, in effect it is just slightly optimized from SLEEF reference code in C.

I was not convinced the benefit of turning an optimized C to a handcraft assembly, merely to streamline the branches. Go should not be slow either. We just need numbers to say for itself. 

A straightforward translation and benchmark shown 4x difference on my Mac Mini.

sleef.BenchmarkSincos  10000000        169 ns/op
sleef.BenchmarkMathSincos 50000000         39 ns/op

Turn math.Pow and div to mul, and inline the function, might speed it up. Let's try again:

sleef.BenchmarkSincos  50000000        51 ns/op
sleef.BenchmarkMathSincos 50000000         38 ns/op

The benefit is not so convincing to have an architecture dependent optimization. Compiler does a fairly good job already, and most importantly, it is trustworthy, and portable. Optimization without a benchmark always needs to be justified.

How is that performed on ARM?

Misfortune struck. Just then I plugged out LAN cable from the pandaboard for Mac. When I plugged it back, it kept telling me files were read-only. Reboot without a proper shutdown (cannot as file system is readonly), rendered a corrupted EXT3-fs on my mmcblk0p2. All my math.Sqrt work were there, gone, luckily up to the cloud already.

DONT PANIC. Let's try e2fsck /dev/mmcblk0p2 first. Wow, after some clearing and fixing, SD is back to business.

The baseline, with Pow and Div.

sleef.BenchmarkSincos    1000000              2490 ns/op
sleef.BenchmarkMathSincos        5000000               442 ns/op

The optimized version, much faster, comparable to existing sin.go algorithm.

sleef.BenchmarkSincos    5000000               400 ns/op
sleef.BenchmarkMathSincos        5000000               443 ns/op

Worth converting to SLEEF? It depends on another key factor - accuracy. That is the whole point of SLEEF -  minimize the cancellation error to improve the accuracy, as demonstrated in the paper.

That accuracy holds for sincos, not exp. I had to reduce the tolerance from 1e-14 to 1e-6 to pass gotest. Exp.go is also slower than exp_386.s on my Linux 8g:

sleef.BenchmarkExp 20000000       117 ns/op
sleef.BenchmarkMathExp 50000000        69 ns/op

But faster than expGo on my ARM 5g:

sleef.BenchmarkExp      10000000               204 ns/op  
sleef.BenchmarkMathExp   5000000               525 ns/op 

It could be faster, as 5g does not do a good job at const folding yet, as demonstrated by NOT using T1 to T4 and calculating them in place instead:

sleef.BenchmarkSincos    1000000              1565 ns/op 
sleef.BenchmarkMathSincos        5000000               442 ns/op  

Enough benchmarking. My head is free-wheeling already. Go get a shot of Espresso and rest in pea's now!

Ok. Last word to the world, the full set of benchmark for SLEEF sincos, asin/acos, exp and log, on Intel 32bit ubuntu , 64bit Mac OSX and ARM Cortex-A9.

Linux on Intel(R) Core(TM)2 Duo CPU     T7300  @ 2.00GHz, bogomips : 3990.01

sleef.BenchmarkSincos 10000000       285 ns/op
sleef.BenchmarkMathSincos 50000000        55 ns/op
sleef.BenchmarkAsin  500000      3161 ns/op
sleef.BenchmarkMathAsin 20000000       103 ns/op
sleef.BenchmarkAcos 1000000      2017 ns/op
sleef.BenchmarkMathAcos 20000000       105 ns/op
sleef.BenchmarkExp 20000000       117 ns/op
sleef.BenchmarkMathExp 50000000        69 ns/op
sleef.BenchmarkLog 10000000       279 ns/op
sleef.BenchmarkMathLog 50000000        31 ns/op

Linux on ARMv7 OMAP4 Pandaboard BogoMIPS        : 2009.29
sleef.BenchmarkSincos    5000000               400 ns/op                        
sleef.BenchmarkMathSincos        5000000               443 ns/op                
sleef.BenchmarkAsin       200000             12983 ns/op                        
sleef.BenchmarkMathAsin  1000000              1672 ns/op                        
sleef.BenchmarkAcos       200000             12890 ns/op                        
sleef.BenchmarkMathAcos  1000000              1705 ns/op                        
sleef.BenchmarkExp      10000000               204 ns/op                        
sleef.BenchmarkMathExp   5000000               531 ns/op                        
sleef.BenchmarkLog       5000000               662 ns/op                        
sleef.BenchmarkMathLog   5000000               418 ns/op

Mac OSX on 2.4Ghz Intel Core 2 Duo

sleef.BenchmarkSincos    5000000               51 ns/op                        
sleef.BenchmarkMathSincos        5000000               38 ns/op                
sleef.BenchmarkAsin       200000             1122 ns/op                        
sleef.BenchmarkMathAsin  1000000              68 ns/op                        
sleef.BenchmarkAcos       200000             1107 ns/op                        
sleef.BenchmarkMathAcos  1000000              75 ns/op                        
sleef.BenchmarkExp      10000000               51 ns/op                        
sleef.BenchmarkMathExp   5000000               29 ns/op                        
sleef.BenchmarkLog       5000000               149 ns/op                        
sleef.BenchmarkMathLog   5000000               26 ns/op