go@localhost:~/go/ex$ cat f.c double mysqrt(double a){ return sqrt(a);} main(){ mysqrt(0.1);} go@localhost:~/go/ex$ cat f.go package main import "math" func mysqrt(a float64) float64 {return math.Sqrt(a)} func main(){ mysqrt(0.1)}
Now compare the gcc's disassembly
000083e0 <mysqrt>:
push {r7, lr}
sub sp, #8
add r7, sp, #0
strd r0, r1, [r7]
vldr d6, [r7]
vsqrt.f64 d7, d6
vcmp.f64 d7, d7
vmrs APSR_nzcv, fpscr
beq.n 8408 <mysqrt+0x28>
vmov r0, r1, d6
blx 8354 <_init+0x48>
vmov d7, r0, r1
vmov r2, r3, d7
mov r0, r2
mov r1, r3
add.w r7, r7, #8
mov sp, r7
pop {r7, pc}
00008418 <main>:
push {r7, lr}
add r7, sp, #0
add r1, pc, #16
ldrd r0, r1, [r1]
bl 83e0 <mysqrt>
mov r0, r3
pop {r7, pc}
to gc's:00010c00 <main.mysqrt>:
ldr r1, [sl]
cmp sp, r1
movcc r1, #180 ; 0xb4
movcc r2, #16
movcc r3, lr
blcc 14398 <runtime.morestack>
str lr, [sp, #-20]!
vldr d0, [sp, #24]
vstr d0, [sp, #4]
bl 280e4 <math.sqrt>
vldr d0, [sp, #12]
vstr d0, [sp, #32]
ldr pc, [sp], #20
00010c34 <main.main>:
ldr r1, [sl]
cmp sp, r1
movcc r1, #180 ; 0xb4
movcc r2, #0
movcc r3, lr
blcc 14398 <runtime.morestack>
str lr, [sp, #-20]!
ldr fp, [pc, #16]
vldr d0, [fp]
vstr d0, [sp, #4]
bl 10c00 <main.mysqrt>
ldr pc, [sp], #20
b 10c64 <main.main+0x30>
Russ is correct. Gcc uses intrinsics to generate VSQRT instruction in the code stream directly, while Gc uses a function call, which it seems expensive. Go will have function inlining, tough not available now. But look at A9 Neon MPE reference manual, VSQRTD uses 32 cycles (VDIV is 25, most others including VMUL are just 1 or 2 cycles), so the saving from function inlining may not help much.Of course, SQRT in hardware already speeds up nbody benchmark by 7 times. ARM VFP also has ABS and NEG instruction. They are trivial at first glance, because what they do are simply clear, or negate the MSB (sign bit) of the operand register. Cannot this be done by simple AND or XOR? Yes. But we also need to know:
- MPE has separate VFP and NEON (SIMD) block share the same 32 sixty-four bit register file.
- VFP has no logic operation.
- NEON has, but switching between VFP and NEON in A9, is expensive.
- Move back and forth between VFP and ARM core registers stalls both pipelines, should not be used in tight loop.
Reason No. 2 and 4 explains the reasons for VABS and VNEG, but not a strong one. I am happy to live without them.
Reason No. 3 is critical. FPU normally is mandatory for C and Go compilers (software emulation in both compiler and Linux kernel exists to help low end chips that without hardware floating point unit), but SIMD, is in the land of handcrafts, is used to speed up certain special operation, which in turn the special hardware logic tends to do much better job, thus renders SIMD a white elephant in the end. For example, NVidia is frequently challenged why not implement NEON in its cutting edge Tegra 2, the reason quoted usually is, with dedicate Audio/video codec and 2D/3D GPU, they don't see an immediate needs for NEON. Most other chips, including NVidia's next generation Tegra, have NEON, for programmers to waste time on.
Back to package math. I would like to port sincos_amd64.s to ARM VFP. That function can be used as the basis for other trigonometry functions. The algorithm, SLEEF, is intended for SIMD, but this implementation is not. It is a slightly optimized translation from C. So it would be straightforward to have a Go copy. But, as said in its comment, a VCMP would save a branch, always a win in modern deep pipelined CPU.
What does that mean? Dive into assembly again. A Go like this:
package main
func main(){
a, b := 1.0, 2.0
if a > b {
b = a
}
}
What does that mean? Dive into assembly again. A Go like this:
package main
func main(){
a, b := 1.0, 2.0
if a > b {
b = a
}
}
would 5g -S to this:
--- prog list "main" ---
0000 (f.go:3) TEXT main+0(SB),R0,$32-0
0001 (f.go:4) MOVD $(1.00000000000000000e+00),F4
0002 (f.go:4) MOVD $(2.00000000000000000e+00),F3
0003 (f.go:5) B ,5(APC)
0004 (f.go:5) B ,16(APC)
0005 (f.go:5) B ,7(APC)
0006 (f.go:5) B ,14(APC)
0007 (f.go:5) MOVD F4,F0
0008 (f.go:5) MOVD F4,F2
0009 (f.go:5) MOVD F3,F1
0010 (f.go:5) CMPD F3,F4,
0011 (f.go:5) BVS ,13(APC)
0012 (f.go:5) BGT ,6(APC)
0013 (f.go:5) B ,4(APC)
0014 (f.go:6) MOVD F2,F0
0015 (f.go:5) B ,16(APC)
0016 (f.go:8) RET ,
Lot of B for branches. Modern CPU puts lot of silicon to predict the branches, to squeeze out the pipeline bubbles caused by mis-prediction. But if we work on assembly, the code could be much optimized with conditional instructions. That could be a big deal for things like math.Sincos, because it is often used in tight loops for thousands of calculations.
Looking forward to another commit source-icide soon.
Looking forward to another commit source-icide soon.