SLEEF has 128-bit SSE optimized code to obtain sin and cos at the same time, but sincos_amd64.s only uses 64-bit FPU, in effect it is just slightly optimized from SLEEF reference code in C.
I was not convinced the benefit of turning an optimized C to a handcraft assembly, merely to streamline the branches. Go should not be slow either. We just need numbers to say for itself.
A straightforward translation and benchmark shown 4x difference on my Mac Mini.
sleef.BenchmarkSincos 10000000 169 ns/op
sleef.BenchmarkMathSincos 50000000 39 ns/op
Turn math.Pow and div to mul, and inline the function, might speed it up. Let's try again:
sleef.BenchmarkSincos 50000000 51 ns/op
sleef.BenchmarkMathSincos 50000000 38 ns/op
The benefit is not so convincing to have an architecture dependent optimization. Compiler does a fairly good job already, and most importantly, it is trustworthy, and portable. Optimization without a benchmark always needs to be justified.
How is that performed on ARM?
Misfortune struck. Just then I plugged out LAN cable from the pandaboard for Mac. When I plugged it back, it kept telling me files were read-only. Reboot without a proper shutdown (cannot as file system is readonly), rendered a corrupted EXT3-fs on my mmcblk0p2. All my math.Sqrt work were there, gone, luckily up to the cloud already.
DONT PANIC. Let's try e2fsck /dev/mmcblk0p2 first. Wow, after some clearing and fixing, SD is back to business.
The baseline, with Pow and Div.
sleef.BenchmarkSincos 1000000 2490 ns/op
sleef.BenchmarkMathSincos 5000000 442 ns/op
The optimized version, much faster, comparable to existing sin.go algorithm.
sleef.BenchmarkSincos 5000000 400 ns/op
sleef.BenchmarkMathSincos 5000000 443 ns/op
Worth converting to SLEEF? It depends on another key factor - accuracy. That is the whole point of SLEEF - minimize the cancellation error to improve the accuracy, as demonstrated in the paper.
That accuracy holds for sincos, not exp. I had to reduce the tolerance from 1e-14 to 1e-6 to pass gotest. Exp.go is also slower than exp_386.s on my Linux 8g:
sleef.BenchmarkExp 20000000 117 ns/op
sleef.BenchmarkMathExp 50000000 69 ns/op
Ok. Last word to the world, the full set of benchmark for SLEEF sincos, asin/acos, exp and log, on Intel 32bit ubuntu , 64bit Mac OSX and ARM Cortex-A9.
Linux on Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz, bogomips : 3990.01
sleef.BenchmarkSincos 10000000 285 ns/op
sleef.BenchmarkMathSincos 50000000 55 ns/op
sleef.BenchmarkAsin 500000 3161 ns/op
sleef.BenchmarkMathAsin 20000000 103 ns/op
sleef.BenchmarkAcos 1000000 2017 ns/op
sleef.BenchmarkMathAcos 20000000 105 ns/op
sleef.BenchmarkExp 20000000 117 ns/op
sleef.BenchmarkMathExp 50000000 69 ns/op
sleef.BenchmarkLog 10000000 279 ns/op
sleef.BenchmarkMathLog 50000000 31 ns/op
Mac OSX on 2.4Ghz Intel Core 2 Duo
That accuracy holds for sincos, not exp. I had to reduce the tolerance from 1e-14 to 1e-6 to pass gotest. Exp.go is also slower than exp_386.s on my Linux 8g:
sleef.BenchmarkExp 20000000 117 ns/op
sleef.BenchmarkMathExp 50000000 69 ns/op
But faster than expGo on my ARM 5g:
sleef.BenchmarkExp 10000000 204 ns/op
sleef.BenchmarkMathExp 5000000 525 ns/op
It could be faster, as 5g does not do a good job at const folding yet, as demonstrated by NOT using T1 to T4 and calculating them in place instead:
sleef.BenchmarkSincos 1000000 1565 ns/op
sleef.BenchmarkMathSincos 5000000 442 ns/op
Enough benchmarking. My head is free-wheeling already. Go get a shot of Espresso and rest in pea's now!
Ok. Last word to the world, the full set of benchmark for SLEEF sincos, asin/acos, exp and log, on Intel 32bit ubuntu , 64bit Mac OSX and ARM Cortex-A9.
Linux on Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz, bogomips : 3990.01
sleef.BenchmarkSincos 10000000 285 ns/op
sleef.BenchmarkMathSincos 50000000 55 ns/op
sleef.BenchmarkAsin 500000 3161 ns/op
sleef.BenchmarkMathAsin 20000000 103 ns/op
sleef.BenchmarkAcos 1000000 2017 ns/op
sleef.BenchmarkMathAcos 20000000 105 ns/op
sleef.BenchmarkExp 20000000 117 ns/op
sleef.BenchmarkMathExp 50000000 69 ns/op
sleef.BenchmarkLog 10000000 279 ns/op
sleef.BenchmarkMathLog 50000000 31 ns/op
Linux on ARMv7 OMAP4 Pandaboard BogoMIPS : 2009.29
sleef.BenchmarkSincos 5000000 400 ns/op
sleef.BenchmarkMathSincos 5000000 443 ns/op
sleef.BenchmarkAsin 200000 12983 ns/op
sleef.BenchmarkMathAsin 1000000 1672 ns/op
sleef.BenchmarkAcos 200000 12890 ns/op
sleef.BenchmarkMathAcos 1000000 1705 ns/op
sleef.BenchmarkExp 10000000 204 ns/op
sleef.BenchmarkMathExp 5000000 531 ns/op
sleef.BenchmarkLog 5000000 662 ns/op
sleef.BenchmarkMathLog 5000000 418 ns/op
Mac OSX on 2.4Ghz Intel Core 2 Duo
sleef.BenchmarkSincos 5000000 51 ns/op
sleef.BenchmarkMathSincos 5000000 38 ns/op
sleef.BenchmarkAsin 200000 1122 ns/op
sleef.BenchmarkMathAsin 1000000 68 ns/op
sleef.BenchmarkAcos 200000 1107 ns/op
sleef.BenchmarkMathAcos 1000000 75 ns/op
sleef.BenchmarkExp 10000000 51 ns/op
sleef.BenchmarkMathExp 5000000 29 ns/op
sleef.BenchmarkLog 5000000 149 ns/op
sleef.BenchmarkMathLog 5000000 26 ns/op