Jun 9, 2011

SLEEF, an excursion on Go's math

A benchmark led to a contribution, and a discovery of sincos_amd64.s, further led to a pleasant sunny Sunday afternoon, reading the SLEEF paper. Thank you Mr. Naoki Shibata for your kind email.

SLEEF has 128-bit SSE optimized code to obtain sin and cos at the same time, but sincos_amd64.s only uses 64-bit FPU, in effect it is just slightly optimized from SLEEF reference code in C.

I was not convinced the benefit of turning an optimized C to a handcraft assembly, merely to streamline the branches. Go should not be slow either. We just need numbers to say for itself. 

A straightforward translation and benchmark shown 4x difference on my Mac Mini.

sleef.BenchmarkSincos  10000000        169 ns/op
sleef.BenchmarkMathSincos 50000000         39 ns/op

Turn math.Pow and div to mul, and inline the function, might speed it up. Let's try again:

sleef.BenchmarkSincos  50000000        51 ns/op
sleef.BenchmarkMathSincos 50000000         38 ns/op

The benefit is not so convincing to have an architecture dependent optimization. Compiler does a fairly good job already, and most importantly, it is trustworthy, and portable. Optimization without a benchmark always needs to be justified.

How is that performed on ARM?

Misfortune struck. Just then I plugged out LAN cable from the pandaboard for Mac. When I plugged it back, it kept telling me files were read-only. Reboot without a proper shutdown (cannot as file system is readonly), rendered a corrupted EXT3-fs on my mmcblk0p2. All my math.Sqrt work were there, gone, luckily up to the cloud already.

DONT PANIC. Let's try e2fsck /dev/mmcblk0p2 first. Wow, after some clearing and fixing, SD is back to business.

The baseline, with Pow and Div.

sleef.BenchmarkSincos    1000000              2490 ns/op
sleef.BenchmarkMathSincos        5000000               442 ns/op

The optimized version, much faster, comparable to existing sin.go algorithm.

sleef.BenchmarkSincos    5000000               400 ns/op
sleef.BenchmarkMathSincos        5000000               443 ns/op

Worth converting to SLEEF? It depends on another key factor - accuracy. That is the whole point of SLEEF -  minimize the cancellation error to improve the accuracy, as demonstrated in the paper.

That accuracy holds for sincos, not exp. I had to reduce the tolerance from 1e-14 to 1e-6 to pass gotest. Exp.go is also slower than exp_386.s on my Linux 8g:


sleef.BenchmarkExp 20000000       117 ns/op
sleef.BenchmarkMathExp 50000000        69 ns/op

But faster than expGo on my ARM 5g:

sleef.BenchmarkExp      10000000               204 ns/op  
sleef.BenchmarkMathExp   5000000               525 ns/op 

It could be faster, as 5g does not do a good job at const folding yet, as demonstrated by NOT using T1 to T4 and calculating them in place instead:

sleef.BenchmarkSincos    1000000              1565 ns/op 
sleef.BenchmarkMathSincos        5000000               442 ns/op  

Enough benchmarking. My head is free-wheeling already. Go get a shot of Espresso and rest in pea's now!

Ok. Last word to the world, the full set of benchmark for SLEEF sincos, asin/acos, exp and log, on Intel 32bit ubuntu , 64bit Mac OSX and ARM Cortex-A9.

Linux on Intel(R) Core(TM)2 Duo CPU     T7300  @ 2.00GHz, bogomips : 3990.01

sleef.BenchmarkSincos 10000000       285 ns/op
sleef.BenchmarkMathSincos 50000000        55 ns/op
sleef.BenchmarkAsin  500000      3161 ns/op
sleef.BenchmarkMathAsin 20000000       103 ns/op
sleef.BenchmarkAcos 1000000      2017 ns/op
sleef.BenchmarkMathAcos 20000000       105 ns/op
sleef.BenchmarkExp 20000000       117 ns/op
sleef.BenchmarkMathExp 50000000        69 ns/op
sleef.BenchmarkLog 10000000       279 ns/op
sleef.BenchmarkMathLog 50000000        31 ns/op

Linux on ARMv7 OMAP4 Pandaboard BogoMIPS        : 2009.29
sleef.BenchmarkSincos    5000000               400 ns/op                        
sleef.BenchmarkMathSincos        5000000               443 ns/op                
sleef.BenchmarkAsin       200000             12983 ns/op                        
sleef.BenchmarkMathAsin  1000000              1672 ns/op                        
sleef.BenchmarkAcos       200000             12890 ns/op                        
sleef.BenchmarkMathAcos  1000000              1705 ns/op                        
sleef.BenchmarkExp      10000000               204 ns/op                        
sleef.BenchmarkMathExp   5000000               531 ns/op                        
sleef.BenchmarkLog       5000000               662 ns/op                        
sleef.BenchmarkMathLog   5000000               418 ns/op


Mac OSX on 2.4Ghz Intel Core 2 Duo

sleef.BenchmarkSincos    5000000               51 ns/op                        
sleef.BenchmarkMathSincos        5000000               38 ns/op                
sleef.BenchmarkAsin       200000             1122 ns/op                        
sleef.BenchmarkMathAsin  1000000              68 ns/op                        
sleef.BenchmarkAcos       200000             1107 ns/op                        
sleef.BenchmarkMathAcos  1000000              75 ns/op                        
sleef.BenchmarkExp      10000000               51 ns/op                        
sleef.BenchmarkMathExp   5000000               29 ns/op                        
sleef.BenchmarkLog       5000000               149 ns/op                        
sleef.BenchmarkMathLog   5000000               26 ns/op

Jun 2, 2011

Easy Going with ARM, square root up [3]

In my last post, we noticed that gc is still nearly twice as slow as the gcc for nbody benchmark. So, why? We put a simplest program under test.

go@localhost:~/go/ex$ cat f.c
double mysqrt(double a){ return sqrt(a);}
main(){ mysqrt(0.1);}

go@localhost:~/go/ex$ cat f.go
package main
import "math"
func mysqrt(a float64) float64 {return math.Sqrt(a)}
func main(){ mysqrt(0.1)}

Now compare the gcc's disassembly

000083e0 <mysqrt>:
push    {r7, lr}
sub     sp, #8
add     r7, sp, #0
strd    r0, r1, [r7]
vldr    d6, [r7]
vsqrt.f64       d7, d6
vcmp.f64        d7, d7
vmrs    APSR_nzcv, fpscr
beq.n   8408 <mysqrt+0x28>
vmov    r0, r1, d6
blx     8354 <_init+0x48>
vmov    d7, r0, r1
vmov    r2, r3, d7
mov     r0, r2
mov     r1, r3
add.w   r7, r7, #8
mov     sp, r7
pop     {r7, pc}

00008418 <main>:
push    {r7, lr}
add     r7, sp, #0
add     r1, pc, #16 
ldrd    r0, r1, [r1]
bl      83e0 <mysqrt>
mov     r0, r3
pop     {r7, pc}
to gc's:
00010c00 <main.mysqrt>:
ldr     r1, [sl]
cmp     sp, r1
movcc   r1, #180        ; 0xb4
movcc   r2, #16
movcc   r3, lr
blcc    14398 <runtime.morestack>
str     lr, [sp, #-20]!
vldr    d0, [sp, #24]
vstr    d0, [sp, #4]
bl      280e4 <math.sqrt>
vldr    d0, [sp, #12]
vstr    d0, [sp, #32]
ldr     pc, [sp], #20

00010c34 <main.main>:
ldr     r1, [sl]
cmp     sp, r1
movcc   r1, #180        ; 0xb4
movcc   r2, #0
movcc   r3, lr
blcc    14398 <runtime.morestack>
str     lr, [sp, #-20]!
ldr     fp, [pc, #16] 
vldr    d0, [fp]
vstr    d0, [sp, #4]
bl      10c00 <main.mysqrt>
ldr     pc, [sp], #20
b       10c64 <main.main+0x30>
Russ is correct. Gcc uses intrinsics to generate VSQRT instruction in the code stream directly, while Gc uses a function call, which it seems expensive. Go will have function inlining, tough not available now. But look at A9 Neon MPE reference manual, VSQRTD uses 32 cycles (VDIV is 25, most others including VMUL are just 1 or 2 cycles), so the saving from function inlining may not help much.

Of course, SQRT in hardware already speeds up nbody benchmark by 7 times.  ARM VFP also has ABS and NEG instruction. They are trivial at first glance, because what they do are simply clear, or negate the MSB (sign bit) of the operand register. Cannot this be done by simple AND or XOR? Yes. But we also need to know:
  1. MPE has separate VFP and NEON (SIMD) block share the same 32 sixty-four bit register file. 
  2. VFP has no logic operation.
  3. NEON has, but switching between VFP and NEON in A9, is expensive.
  4. Move back and forth between VFP and ARM core registers stalls both pipelines, should not be used in tight loop. 
Reason No. 2 and 4 explains the reasons for VABS and VNEG, but not a strong one. I am happy to live without them.

Reason No. 3 is critical. FPU normally is mandatory for C and Go compilers (software emulation in both compiler and Linux kernel exists to help low end chips that without hardware floating point unit), but SIMD, is in the land of handcrafts, is used to speed up certain special operation, which in turn the special hardware logic tends to do much better job, thus renders SIMD a white elephant in the end. For example, NVidia is frequently challenged why not implement NEON in its cutting edge Tegra 2, the reason quoted usually is, with dedicate Audio/video codec and 2D/3D GPU, they don't see an immediate needs for NEON. Most other chips, including NVidia's next generation Tegra, have NEON, for programmers to waste time on. 

Back to package math. I would like to port sincos_amd64.s to ARM VFP. That function can be used as the basis for other trigonometry functions. The algorithm, SLEEF, is intended for SIMD, but this implementation is not. It is a slightly optimized translation from C. So it would be straightforward to have a Go copy. But, as said in its comment, a VCMP would save a branch, always a win in modern deep pipelined CPU.

What does that mean? Dive into assembly again. A Go like this:

package main

func main(){
a, b := 1.0, 2.0
if a > b {
b = a
}
}

would 5g -S to this:

--- prog list "main" ---
0000 (f.go:3) TEXT   main+0(SB),R0,$32-0
0001 (f.go:4) MOVD   $(1.00000000000000000e+00),F4
0002 (f.go:4) MOVD   $(2.00000000000000000e+00),F3
0003 (f.go:5) B       ,5(APC)
0004 (f.go:5) B       ,16(APC)
0005 (f.go:5) B       ,7(APC)
0006 (f.go:5) B       ,14(APC)
0007 (f.go:5) MOVD   F4,F0
0008 (f.go:5) MOVD   F4,F2
0009 (f.go:5) MOVD   F3,F1
0010 (f.go:5) CMPD   F3,F4,
0011 (f.go:5) BVS     ,13(APC)
0012 (f.go:5) BGT     ,6(APC)
0013 (f.go:5) B       ,4(APC)
0014 (f.go:6) MOVD   F2,F0
0015 (f.go:5) B       ,16(APC)
0016 (f.go:8) RET     ,

Lot of B for branches. Modern CPU puts lot of silicon to predict the branches, to squeeze out the pipeline bubbles caused by mis-prediction. But if we work on assembly, the code could be much optimized with conditional instructions. That could be a big deal for things like math.Sincos, because it is often used in tight loops for thousands of calculations.

Looking forward to another commit source-icide soon.