Jun 2, 2011

Easy Going with ARM, square root up [3]

In my last post, we noticed that gc is still nearly twice as slow as the gcc for nbody benchmark. So, why? We put a simplest program under test.

go@localhost:~/go/ex$ cat f.c
double mysqrt(double a){ return sqrt(a);}
main(){ mysqrt(0.1);}

go@localhost:~/go/ex$ cat f.go
package main
import "math"
func mysqrt(a float64) float64 {return math.Sqrt(a)}
func main(){ mysqrt(0.1)}

Now compare the gcc's disassembly

000083e0 <mysqrt>:
push    {r7, lr}
sub     sp, #8
add     r7, sp, #0
strd    r0, r1, [r7]
vldr    d6, [r7]
vsqrt.f64       d7, d6
vcmp.f64        d7, d7
vmrs    APSR_nzcv, fpscr
beq.n   8408 <mysqrt+0x28>
vmov    r0, r1, d6
blx     8354 <_init+0x48>
vmov    d7, r0, r1
vmov    r2, r3, d7
mov     r0, r2
mov     r1, r3
add.w   r7, r7, #8
mov     sp, r7
pop     {r7, pc}

00008418 <main>:
push    {r7, lr}
add     r7, sp, #0
add     r1, pc, #16 
ldrd    r0, r1, [r1]
bl      83e0 <mysqrt>
mov     r0, r3
pop     {r7, pc}
to gc's:
00010c00 <main.mysqrt>:
ldr     r1, [sl]
cmp     sp, r1
movcc   r1, #180        ; 0xb4
movcc   r2, #16
movcc   r3, lr
blcc    14398 <runtime.morestack>
str     lr, [sp, #-20]!
vldr    d0, [sp, #24]
vstr    d0, [sp, #4]
bl      280e4 <math.sqrt>
vldr    d0, [sp, #12]
vstr    d0, [sp, #32]
ldr     pc, [sp], #20

00010c34 <main.main>:
ldr     r1, [sl]
cmp     sp, r1
movcc   r1, #180        ; 0xb4
movcc   r2, #0
movcc   r3, lr
blcc    14398 <runtime.morestack>
str     lr, [sp, #-20]!
ldr     fp, [pc, #16] 
vldr    d0, [fp]
vstr    d0, [sp, #4]
bl      10c00 <main.mysqrt>
ldr     pc, [sp], #20
b       10c64 <main.main+0x30>
Russ is correct. Gcc uses intrinsics to generate VSQRT instruction in the code stream directly, while Gc uses a function call, which it seems expensive. Go will have function inlining, tough not available now. But look at A9 Neon MPE reference manual, VSQRTD uses 32 cycles (VDIV is 25, most others including VMUL are just 1 or 2 cycles), so the saving from function inlining may not help much.

Of course, SQRT in hardware already speeds up nbody benchmark by 7 times.  ARM VFP also has ABS and NEG instruction. They are trivial at first glance, because what they do are simply clear, or negate the MSB (sign bit) of the operand register. Cannot this be done by simple AND or XOR? Yes. But we also need to know:
  1. MPE has separate VFP and NEON (SIMD) block share the same 32 sixty-four bit register file. 
  2. VFP has no logic operation.
  3. NEON has, but switching between VFP and NEON in A9, is expensive.
  4. Move back and forth between VFP and ARM core registers stalls both pipelines, should not be used in tight loop. 
Reason No. 2 and 4 explains the reasons for VABS and VNEG, but not a strong one. I am happy to live without them.

Reason No. 3 is critical. FPU normally is mandatory for C and Go compilers (software emulation in both compiler and Linux kernel exists to help low end chips that without hardware floating point unit), but SIMD, is in the land of handcrafts, is used to speed up certain special operation, which in turn the special hardware logic tends to do much better job, thus renders SIMD a white elephant in the end. For example, NVidia is frequently challenged why not implement NEON in its cutting edge Tegra 2, the reason quoted usually is, with dedicate Audio/video codec and 2D/3D GPU, they don't see an immediate needs for NEON. Most other chips, including NVidia's next generation Tegra, have NEON, for programmers to waste time on. 

Back to package math. I would like to port sincos_amd64.s to ARM VFP. That function can be used as the basis for other trigonometry functions. The algorithm, SLEEF, is intended for SIMD, but this implementation is not. It is a slightly optimized translation from C. So it would be straightforward to have a Go copy. But, as said in its comment, a VCMP would save a branch, always a win in modern deep pipelined CPU.

What does that mean? Dive into assembly again. A Go like this:

package main

func main(){
a, b := 1.0, 2.0
if a > b {
b = a

would 5g -S to this:

--- prog list "main" ---
0000 (f.go:3) TEXT   main+0(SB),R0,$32-0
0001 (f.go:4) MOVD   $(1.00000000000000000e+00),F4
0002 (f.go:4) MOVD   $(2.00000000000000000e+00),F3
0003 (f.go:5) B       ,5(APC)
0004 (f.go:5) B       ,16(APC)
0005 (f.go:5) B       ,7(APC)
0006 (f.go:5) B       ,14(APC)
0007 (f.go:5) MOVD   F4,F0
0008 (f.go:5) MOVD   F4,F2
0009 (f.go:5) MOVD   F3,F1
0010 (f.go:5) CMPD   F3,F4,
0011 (f.go:5) BVS     ,13(APC)
0012 (f.go:5) BGT     ,6(APC)
0013 (f.go:5) B       ,4(APC)
0014 (f.go:6) MOVD   F2,F0
0015 (f.go:5) B       ,16(APC)
0016 (f.go:8) RET     ,

Lot of B for branches. Modern CPU puts lot of silicon to predict the branches, to squeeze out the pipeline bubbles caused by mis-prediction. But if we work on assembly, the code could be much optimized with conditional instructions. That could be a big deal for things like math.Sincos, because it is often used in tight loops for thousands of calculations.

Looking forward to another commit source-icide soon.