Going Along: 2011-06

Jun 9, 2011

SLEEF, an excursion on Go's math

A benchmark led to a contribution, and a discovery of sincos_amd64.s, further led to a pleasant sunny Sunday afternoon, reading the SLEEF paper. Thank you Mr. Naoki Shibata for your kind email.

SLEEF has 128-bit SSE optimized code to obtain sin and cos at the same time, but sincos_amd64.s only uses 64-bit FPU, in effect it is just slightly optimized from SLEEF reference code in C.

I was not convinced the benefit of turning an optimized C to a handcraft assembly, merely to streamline the branches. Go should not be slow either. We just need numbers to say for itself.

A straightforward translation and benchmark shown 4x difference on my Mac Mini.

sleef.BenchmarkSincos 10000000 169 ns/op

sleef.BenchmarkMathSincos 50000000 39 ns/op

Turn math.Pow and div to mul, and inline the function, might speed it up. Let's try again:

sleef.BenchmarkSincos 50000000 51 ns/op

sleef.BenchmarkMathSincos 50000000 38 ns/op

The benefit is not so convincing to have an architecture dependent optimization. Compiler does a fairly good job already, and most importantly, it is trustworthy, and portable. Optimization without a benchmark always needs to be justified.

How is that performed on ARM?

Misfortune struck. Just then I plugged out LAN cable from the pandaboard for Mac. When I plugged it back, it kept telling me files were read-only. Reboot without a proper shutdown (cannot as file system is readonly), rendered a corrupted EXT3-fs on my mmcblk0p2. All my math.Sqrt work were there, gone, luckily up to the cloud already.

DONT PANIC. Let's try e2fsck /dev/mmcblk0p2 first. Wow, after some clearing and fixing, SD is back to business.

The baseline, with Pow and Div.

sleef.BenchmarkSincos 1000000 2490 ns/op
sleef.BenchmarkMathSincos 5000000 442 ns/op

The optimized version, much faster, comparable to existing sin.go algorithm.

sleef.BenchmarkSincos 5000000 400 ns/op
sleef.BenchmarkMathSincos 5000000 443 ns/op

Worth converting to SLEEF? It depends on another key factor - accuracy. That is the whole point of SLEEF - minimize the cancellation error to improve the accuracy, as demonstrated in the paper.

That accuracy holds for sincos, not exp. I had to reduce the tolerance from 1e-14 to 1e-6 to pass gotest. Exp.go is also slower than exp_386.s on my Linux 8g:

sleef.BenchmarkExp 20000000 117 ns/op
sleef.BenchmarkMathExp 50000000 69 ns/op

But faster than expGo on my ARM 5g:

sleef.BenchmarkExp 10000000 204 ns/op

sleef.BenchmarkMathExp 5000000 525 ns/op

It could be faster, as 5g does not do a good job at const folding yet, as demonstrated by NOT using T1 to T4 and calculating them in place instead:

sleef.BenchmarkSincos 1000000 1565 ns/op

sleef.BenchmarkMathSincos 5000000 442 ns/op

Enough benchmarking. My head is free-wheeling already. Go get a shot of Espresso and rest in pea's now!

Ok. Last word to the world, the full set of benchmark for SLEEF sincos, asin/acos, exp and log, on Intel 32bit ubuntu , 64bit Mac OSX and ARM Cortex-A9.

Linux on Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz, bogomips : 3990.01

sleef.BenchmarkSincos 10000000 285 ns/op
sleef.BenchmarkMathSincos 50000000 55 ns/op
sleef.BenchmarkAsin 500000 3161 ns/op
sleef.BenchmarkMathAsin 20000000 103 ns/op
sleef.BenchmarkAcos 1000000 2017 ns/op
sleef.BenchmarkMathAcos 20000000 105 ns/op
sleef.BenchmarkExp 20000000 117 ns/op
sleef.BenchmarkMathExp 50000000 69 ns/op
sleef.BenchmarkLog 10000000 279 ns/op
sleef.BenchmarkMathLog 50000000 31 ns/op

Linux on ARMv7 OMAP4 Pandaboard BogoMIPS : 2009.29

sleef.BenchmarkSincos 5000000 400 ns/op

sleef.BenchmarkMathSincos 5000000 443 ns/op

sleef.BenchmarkAsin 200000 12983 ns/op

sleef.BenchmarkMathAsin 1000000 1672 ns/op

sleef.BenchmarkAcos 200000 12890 ns/op

sleef.BenchmarkMathAcos 1000000 1705 ns/op

sleef.BenchmarkExp 10000000 204 ns/op

sleef.BenchmarkMathExp 5000000 531 ns/op

sleef.BenchmarkLog 5000000 662 ns/op

sleef.BenchmarkMathLog 5000000 418 ns/op

Mac OSX on 2.4Ghz Intel Core 2 Duo

sleef.BenchmarkSincos 5000000 51 ns/op

sleef.BenchmarkMathSincos 5000000 38 ns/op

sleef.BenchmarkAsin 200000 1122 ns/op

sleef.BenchmarkMathAsin 1000000 68 ns/op

sleef.BenchmarkAcos 200000 1107 ns/op

sleef.BenchmarkMathAcos 1000000 75 ns/op

sleef.BenchmarkExp 10000000 51 ns/op

sleef.BenchmarkMathExp 5000000 29 ns/op

sleef.BenchmarkLog 5000000 149 ns/op

sleef.BenchmarkMathLog 5000000 26 ns/op

Jun 2, 2011

Easy Going with ARM, square root up [3]

In my last post, we noticed that gc is still nearly twice as slow as the gcc for nbody benchmark. So, why? We put a simplest program under test.

go@localhost:~/go/ex$ cat f.c
double mysqrt(double a){ return sqrt(a);}
main(){ mysqrt(0.1);}

go@localhost:~/go/ex$ cat f.go
package main
import "math"
func mysqrt(a float64) float64 {return math.Sqrt(a)}
func main(){ mysqrt(0.1)}

Now compare the gcc's disassembly

000083e0 <mysqrt>:
push    {r7, lr}
sub     sp, #8
add     r7, sp, #0
strd    r0, r1, [r7]
vldr    d6, [r7]
vsqrt.f64       d7, d6
vcmp.f64        d7, d7
vmrs    APSR_nzcv, fpscr
beq.n   8408 <mysqrt+0x28>
vmov    r0, r1, d6
blx     8354 <_init+0x48>
vmov    d7, r0, r1
vmov    r2, r3, d7
mov     r0, r2
mov     r1, r3
add.w   r7, r7, #8
mov     sp, r7
pop     {r7, pc}

00008418 <main>:
push    {r7, lr}
add     r7, sp, #0
add     r1, pc, #16 
ldrd    r0, r1, [r1]
bl      83e0 <mysqrt>
mov     r0, r3
pop     {r7, pc}

to gc's:

00010c00 <main.mysqrt>:
ldr     r1, [sl]
cmp     sp, r1
movcc   r1, #180        ; 0xb4
movcc   r2, #16
movcc   r3, lr
blcc    14398 <runtime.morestack>
str     lr, [sp, #-20]!
vldr    d0, [sp, #24]
vstr    d0, [sp, #4]
bl      280e4 <math.sqrt>
vldr    d0, [sp, #12]
vstr    d0, [sp, #32]
ldr     pc, [sp], #20

00010c34 <main.main>:
ldr     r1, [sl]
cmp     sp, r1
movcc   r1, #180        ; 0xb4
movcc   r2, #0
movcc   r3, lr
blcc    14398 <runtime.morestack>
str     lr, [sp, #-20]!
ldr     fp, [pc, #16] 
vldr    d0, [fp]
vstr    d0, [sp, #4]
bl      10c00 <main.mysqrt>
ldr     pc, [sp], #20
b       10c64 <main.main+0x30>

Russ is correct. Gcc uses intrinsics to generate VSQRT instruction in the code stream directly, while Gc uses a function call, which it seems expensive. Go will have function inlining, tough not available now. But look at A9 Neon MPE reference manual, VSQRTD uses 32 cycles (VDIV is 25, most others including VMUL are just 1 or 2 cycles), so the saving from function inlining may not help much.

Of course, SQRT in hardware already speeds up nbody benchmark by 7 times. ARM VFP also has ABS and NEG instruction. They are trivial at first glance, because what they do are simply clear, or negate the MSB (sign bit) of the operand register. Cannot this be done by simple AND or XOR? Yes. But we also need to know:

MPE has separate VFP and NEON (SIMD) block share the same 32 sixty-four bit register file.
VFP has no logic operation.
NEON has, but switching between VFP and NEON in A9, is expensive.
Move back and forth between VFP and ARM core registers stalls both pipelines, should not be used in tight loop.

Reason No. 2 and 4 explains the reasons for VABS and VNEG, but not a strong one. I am happy to live without them.

Reason No. 3 is critical. FPU normally is mandatory for C and Go compilers (software emulation in both compiler and Linux kernel exists to help low end chips that without hardware floating point unit), but SIMD, is in the land of handcrafts, is used to speed up certain special operation, which in turn the special hardware logic tends to do much better job, thus renders SIMD a white elephant in the end. For example, NVidia is frequently challenged why not implement NEON in its cutting edge Tegra 2, the reason quoted usually is, with dedicate Audio/video codec and 2D/3D GPU, they don't see an immediate needs for NEON. Most other chips, including NVidia's next generation Tegra, have NEON, for programmers to waste time on.

Back to package math. I would like to port sincos_amd64.s to ARM VFP. That function can be used as the basis for other trigonometry functions. The algorithm, SLEEF, is intended for SIMD, but this implementation is not. It is a slightly optimized translation from C. So it would be straightforward to have a Go copy. But, as said in its comment, a VCMP would save a branch, always a win in modern deep pipelined CPU.

What does that mean? Dive into assembly again. A Go like this:

package main

func main(){
a, b := 1.0, 2.0
if a > b {
b = a
}
}

would 5g -S to this:

--- prog list "main" ---

0000 (f.go:3) TEXT main+0(SB),R0,$32-0

0001 (f.go:4) MOVD $(1.00000000000000000e+00),F4

0002 (f.go:4) MOVD $(2.00000000000000000e+00),F3

0003 (f.go:5) B ,5(APC)

0004 (f.go:5) B ,16(APC)

0005 (f.go:5) B ,7(APC)

0006 (f.go:5) B ,14(APC)

0007 (f.go:5) MOVD F4,F0

0008 (f.go:5) MOVD F4,F2

0009 (f.go:5) MOVD F3,F1

0010 (f.go:5) CMPD F3,F4,

0011 (f.go:5) BVS ,13(APC)

0012 (f.go:5) BGT ,6(APC)

0013 (f.go:5) B ,4(APC)

0014 (f.go:6) MOVD F2,F0

0015 (f.go:5) B ,16(APC)

0016 (f.go:8) RET ,

Lot of B for branches. Modern CPU puts lot of silicon to predict the branches, to squeeze out the pipeline bubbles caused by mis-prediction. But if we work on assembly, the code could be much optimized with conditional instructions. That could be a big deal for things like math.Sincos, because it is often used in tight loops for thousands of calculations.

Looking forward to another commit source-icide soon.