Aug 24, 2011

Almost renewed vim

I have been using Acme for many years and almost forgot how complex
the program editors could be. But last several months I have been
digging into headless embedded system, where only Vim is accessible
from serial line (ssh included). I used to be a man full of Vim, but
that days are long gone. Now I am a simple man who is contented with
simple things. So I cannot believe I have spent a whole week reading
'A Byte of Vim' while forcing myself using it for all my code works on
Windows/Linux/Macs, only because Plan9port stopped working after
upgrading to Mac OS X Lion. No. It is a mistake learnt the hard way.
Vim is gone away from me that no force can renew it back. Give up. I'm
a happy Acmer again, even that means I need to run Virtualbox ubuntu
on top of Lion, install Plan9port in order to get back my dear Acme.
Yes. Ubuntu is configured to log on directly into 'Recovery console',
where .profile runs rio & acme & google-chrome &, and xshove the
latter two to full screen, which I am used to alt-tab swapping them.

Jun 9, 2011

SLEEF, an excursion on Go's math

A benchmark led to a contribution, and a discovery of sincos_amd64.s, further led to a pleasant sunny Sunday afternoon, reading the SLEEF paper. Thank you Mr. Naoki Shibata for your kind email.

SLEEF has 128-bit SSE optimized code to obtain sin and cos at the same time, but sincos_amd64.s only uses 64-bit FPU, in effect it is just slightly optimized from SLEEF reference code in C.

I was not convinced the benefit of turning an optimized C to a handcraft assembly, merely to streamline the branches. Go should not be slow either. We just need numbers to say for itself. 

A straightforward translation and benchmark shown 4x difference on my Mac Mini.

sleef.BenchmarkSincos  10000000        169 ns/op
sleef.BenchmarkMathSincos 50000000         39 ns/op

Turn math.Pow and div to mul, and inline the function, might speed it up. Let's try again:

sleef.BenchmarkSincos  50000000        51 ns/op
sleef.BenchmarkMathSincos 50000000         38 ns/op

The benefit is not so convincing to have an architecture dependent optimization. Compiler does a fairly good job already, and most importantly, it is trustworthy, and portable. Optimization without a benchmark always needs to be justified.

How is that performed on ARM?

Misfortune struck. Just then I plugged out LAN cable from the pandaboard for Mac. When I plugged it back, it kept telling me files were read-only. Reboot without a proper shutdown (cannot as file system is readonly), rendered a corrupted EXT3-fs on my mmcblk0p2. All my math.Sqrt work were there, gone, luckily up to the cloud already.

DONT PANIC. Let's try e2fsck /dev/mmcblk0p2 first. Wow, after some clearing and fixing, SD is back to business.

The baseline, with Pow and Div.

sleef.BenchmarkSincos    1000000              2490 ns/op
sleef.BenchmarkMathSincos        5000000               442 ns/op

The optimized version, much faster, comparable to existing sin.go algorithm.

sleef.BenchmarkSincos    5000000               400 ns/op
sleef.BenchmarkMathSincos        5000000               443 ns/op

Worth converting to SLEEF? It depends on another key factor - accuracy. That is the whole point of SLEEF -  minimize the cancellation error to improve the accuracy, as demonstrated in the paper.

That accuracy holds for sincos, not exp. I had to reduce the tolerance from 1e-14 to 1e-6 to pass gotest. Exp.go is also slower than exp_386.s on my Linux 8g:

sleef.BenchmarkExp 20000000       117 ns/op
sleef.BenchmarkMathExp 50000000        69 ns/op

But faster than expGo on my ARM 5g:

sleef.BenchmarkExp      10000000               204 ns/op  
sleef.BenchmarkMathExp   5000000               525 ns/op 

It could be faster, as 5g does not do a good job at const folding yet, as demonstrated by NOT using T1 to T4 and calculating them in place instead:

sleef.BenchmarkSincos    1000000              1565 ns/op 
sleef.BenchmarkMathSincos        5000000               442 ns/op  

Enough benchmarking. My head is free-wheeling already. Go get a shot of Espresso and rest in pea's now!

Ok. Last word to the world, the full set of benchmark for SLEEF sincos, asin/acos, exp and log, on Intel 32bit ubuntu , 64bit Mac OSX and ARM Cortex-A9.

Linux on Intel(R) Core(TM)2 Duo CPU     T7300  @ 2.00GHz, bogomips : 3990.01

sleef.BenchmarkSincos 10000000       285 ns/op
sleef.BenchmarkMathSincos 50000000        55 ns/op
sleef.BenchmarkAsin  500000      3161 ns/op
sleef.BenchmarkMathAsin 20000000       103 ns/op
sleef.BenchmarkAcos 1000000      2017 ns/op
sleef.BenchmarkMathAcos 20000000       105 ns/op
sleef.BenchmarkExp 20000000       117 ns/op
sleef.BenchmarkMathExp 50000000        69 ns/op
sleef.BenchmarkLog 10000000       279 ns/op
sleef.BenchmarkMathLog 50000000        31 ns/op

Linux on ARMv7 OMAP4 Pandaboard BogoMIPS        : 2009.29
sleef.BenchmarkSincos    5000000               400 ns/op                        
sleef.BenchmarkMathSincos        5000000               443 ns/op                
sleef.BenchmarkAsin       200000             12983 ns/op                        
sleef.BenchmarkMathAsin  1000000              1672 ns/op                        
sleef.BenchmarkAcos       200000             12890 ns/op                        
sleef.BenchmarkMathAcos  1000000              1705 ns/op                        
sleef.BenchmarkExp      10000000               204 ns/op                        
sleef.BenchmarkMathExp   5000000               531 ns/op                        
sleef.BenchmarkLog       5000000               662 ns/op                        
sleef.BenchmarkMathLog   5000000               418 ns/op

Mac OSX on 2.4Ghz Intel Core 2 Duo

sleef.BenchmarkSincos    5000000               51 ns/op                        
sleef.BenchmarkMathSincos        5000000               38 ns/op                
sleef.BenchmarkAsin       200000             1122 ns/op                        
sleef.BenchmarkMathAsin  1000000              68 ns/op                        
sleef.BenchmarkAcos       200000             1107 ns/op                        
sleef.BenchmarkMathAcos  1000000              75 ns/op                        
sleef.BenchmarkExp      10000000               51 ns/op                        
sleef.BenchmarkMathExp   5000000               29 ns/op                        
sleef.BenchmarkLog       5000000               149 ns/op                        
sleef.BenchmarkMathLog   5000000               26 ns/op

Jun 2, 2011

Easy Going with ARM, square root up [3]

In my last post, we noticed that gc is still nearly twice as slow as the gcc for nbody benchmark. So, why? We put a simplest program under test.

go@localhost:~/go/ex$ cat f.c
double mysqrt(double a){ return sqrt(a);}
main(){ mysqrt(0.1);}

go@localhost:~/go/ex$ cat f.go
package main
import "math"
func mysqrt(a float64) float64 {return math.Sqrt(a)}
func main(){ mysqrt(0.1)}

Now compare the gcc's disassembly

000083e0 <mysqrt>:
push    {r7, lr}
sub     sp, #8
add     r7, sp, #0
strd    r0, r1, [r7]
vldr    d6, [r7]
vsqrt.f64       d7, d6
vcmp.f64        d7, d7
vmrs    APSR_nzcv, fpscr
beq.n   8408 <mysqrt+0x28>
vmov    r0, r1, d6
blx     8354 <_init+0x48>
vmov    d7, r0, r1
vmov    r2, r3, d7
mov     r0, r2
mov     r1, r3
add.w   r7, r7, #8
mov     sp, r7
pop     {r7, pc}

00008418 <main>:
push    {r7, lr}
add     r7, sp, #0
add     r1, pc, #16 
ldrd    r0, r1, [r1]
bl      83e0 <mysqrt>
mov     r0, r3
pop     {r7, pc}
to gc's:
00010c00 <main.mysqrt>:
ldr     r1, [sl]
cmp     sp, r1
movcc   r1, #180        ; 0xb4
movcc   r2, #16
movcc   r3, lr
blcc    14398 <runtime.morestack>
str     lr, [sp, #-20]!
vldr    d0, [sp, #24]
vstr    d0, [sp, #4]
bl      280e4 <math.sqrt>
vldr    d0, [sp, #12]
vstr    d0, [sp, #32]
ldr     pc, [sp], #20

00010c34 <main.main>:
ldr     r1, [sl]
cmp     sp, r1
movcc   r1, #180        ; 0xb4
movcc   r2, #0
movcc   r3, lr
blcc    14398 <runtime.morestack>
str     lr, [sp, #-20]!
ldr     fp, [pc, #16] 
vldr    d0, [fp]
vstr    d0, [sp, #4]
bl      10c00 <main.mysqrt>
ldr     pc, [sp], #20
b       10c64 <main.main+0x30>
Russ is correct. Gcc uses intrinsics to generate VSQRT instruction in the code stream directly, while Gc uses a function call, which it seems expensive. Go will have function inlining, tough not available now. But look at A9 Neon MPE reference manual, VSQRTD uses 32 cycles (VDIV is 25, most others including VMUL are just 1 or 2 cycles), so the saving from function inlining may not help much.

Of course, SQRT in hardware already speeds up nbody benchmark by 7 times.  ARM VFP also has ABS and NEG instruction. They are trivial at first glance, because what they do are simply clear, or negate the MSB (sign bit) of the operand register. Cannot this be done by simple AND or XOR? Yes. But we also need to know:
  1. MPE has separate VFP and NEON (SIMD) block share the same 32 sixty-four bit register file. 
  2. VFP has no logic operation.
  3. NEON has, but switching between VFP and NEON in A9, is expensive.
  4. Move back and forth between VFP and ARM core registers stalls both pipelines, should not be used in tight loop. 
Reason No. 2 and 4 explains the reasons for VABS and VNEG, but not a strong one. I am happy to live without them.

Reason No. 3 is critical. FPU normally is mandatory for C and Go compilers (software emulation in both compiler and Linux kernel exists to help low end chips that without hardware floating point unit), but SIMD, is in the land of handcrafts, is used to speed up certain special operation, which in turn the special hardware logic tends to do much better job, thus renders SIMD a white elephant in the end. For example, NVidia is frequently challenged why not implement NEON in its cutting edge Tegra 2, the reason quoted usually is, with dedicate Audio/video codec and 2D/3D GPU, they don't see an immediate needs for NEON. Most other chips, including NVidia's next generation Tegra, have NEON, for programmers to waste time on. 

Back to package math. I would like to port sincos_amd64.s to ARM VFP. That function can be used as the basis for other trigonometry functions. The algorithm, SLEEF, is intended for SIMD, but this implementation is not. It is a slightly optimized translation from C. So it would be straightforward to have a Go copy. But, as said in its comment, a VCMP would save a branch, always a win in modern deep pipelined CPU.

What does that mean? Dive into assembly again. A Go like this:

package main

func main(){
a, b := 1.0, 2.0
if a > b {
b = a

would 5g -S to this:

--- prog list "main" ---
0000 (f.go:3) TEXT   main+0(SB),R0,$32-0
0001 (f.go:4) MOVD   $(1.00000000000000000e+00),F4
0002 (f.go:4) MOVD   $(2.00000000000000000e+00),F3
0003 (f.go:5) B       ,5(APC)
0004 (f.go:5) B       ,16(APC)
0005 (f.go:5) B       ,7(APC)
0006 (f.go:5) B       ,14(APC)
0007 (f.go:5) MOVD   F4,F0
0008 (f.go:5) MOVD   F4,F2
0009 (f.go:5) MOVD   F3,F1
0010 (f.go:5) CMPD   F3,F4,
0011 (f.go:5) BVS     ,13(APC)
0012 (f.go:5) BGT     ,6(APC)
0013 (f.go:5) B       ,4(APC)
0014 (f.go:6) MOVD   F2,F0
0015 (f.go:5) B       ,16(APC)
0016 (f.go:8) RET     ,

Lot of B for branches. Modern CPU puts lot of silicon to predict the branches, to squeeze out the pipeline bubbles caused by mis-prediction. But if we work on assembly, the code could be much optimized with conditional instructions. That could be a big deal for things like math.Sincos, because it is often used in tight loops for thousands of calculations.

Looking forward to another commit source-icide soon.

May 31, 2011

Easy Going with ARM, Kung Fu Panda [2]

One of the fastest ARM board we can get today is $179 Pandaboard from Digikey. Do prepare to wait for months though.

I chose to install the headless image of ubuntu 11.04. Because my laptop has a SD slot on /dev/mmcblk0, the installation process was as smooth as my Teflon pan.

  1. insert SD card to Laptop
  2. sudo umount /dev/mmcblk0
  3. sudo sh -c 'zcat ubuntu-11.04-preinstalled-headless-armel+omap4.img.gz > /dev/mmcblk0'
  4. sync
  5. insert SD card to pandaboard, plug power, LAN and USB-Serial cable in
  6. On the laptop terminal: TERM=vt100 minicom
  7. turn on the pandaboard, on minicom after the uboot comes the standard ubuntu installation process, and finally gives us a shell prompt.
The Go installation is the same as in my last post, only much faster. Now comes to the benchmark, how does 1GHz dual core ARM A9 compare to my 2GHz dual core Intel x86, and 1GHz single core ARM A8?

Not surprisingly, the number of cores does not count here since no parallel processing is benchmarked. For string processing 1GHz A9 is slightly faster than A8 , but still more than twice slower than 2GHz x86 core. A9's VFP has been greatly improved, 5x faster than A8 by gcc.

But surprisingly, and I am astonished to see, for the floating point crunching on A9, gc is 11x slower than optimized gcc. This is very unfortunate because what I am interested in Go on ARM is OpenGL ES, which is all about matrix operations on floating points.

[update] 11x slower is caused by unoptimized pkg/math/sqrt.go, since ARM VFP has VSQRT instruction, it should not be hard to speed it up.

[update 2] I made it. Now it is 7x faster

nbody -n 50000000
        gcc -O2 -lm nbody.c     71.40u 0.00s 71.43r
        gc nbody        120.93u 0.00s 120.94r
        gc_B nbody      119.78u 0.00s 119.80r

go@localhost:~/go/test/bench$ cat /proc/cpuinfo
Processor   : ARMv7 Processor rev 2 (v7l)
processor   : 0
BogoMIPS  : 2009.29
processor   : 1        
BogoMIPS  : 1963.08         
Features  : swp half thumb fastmult vfp edsp thumbee neon vfpv3        
Hardware : OMAP4 Panda board 

go@localhost:~/go/test/bench$ gomake timing  
reverse-complement < output-of-fasta-25000000
gcc -O2 reverse-complement.c    7.88u 1.55s 9.54r
gc reverse-complement   18.36u 1.91s 20.29r
gc_B reverse-complement 17.75u 2.08s 19.85r
nbody -n 50000000
gcc -O2 -lm nbody.c     71.40u 0.00s 71.41r
gc nbody        862.53u 0.02s 862.78r
gc_B nbody      865.00u 0.05s 865.28r   

(OMAP3, OMAP4, IMX53 and the earth creator)

May 30, 2011

Easy Going with ARM

How easy would it be to Go with ARM? It depends, from super difficult to duper easy. This is the latter case. Just got one from Mouser for US$149 with free FedEx to Singapore. Opened the box, plugged the cables in, inserted the SD card, and pressed the power switch. 

As advertised, I should expect an instant boot-up. But nothing happened. Dead on arrival? Scratched my head. Where's the manual? Searched the box again. Not found. Looked at the board and thought, should I insert the SD or microSD? No harm to try. 

Sliced the microSD from the SD sleeve, put in that slot, powered up. Bingo. AND as I idly lying back on my armchair and glanced the box again... There you are. The manual and CD are just there, pasted on the back of the box cover!

From then on, it could not be easier to just follow the normal procedure to have a Go on the pre-installed Ubuntu Lucid on the IMX53 Starter-Kit.
  1. sudo apt-get update
  2. sudo apt-get install mercurial bison ed gawk gcc libc6-dev make
  3. hg clone -u release go
  4. cd go/src; ./make.bash
It just needs much longer time (an hour?), because it only runs on a 1GHz ARM Cortex-A8, with 1GB DDR3 RAM, and worst of all, SD is way too slow. Great news is this pixie has SATA port. Next time will try find a spare hard disk and see how speedy it could go.

How slow is it? Here's the go/test/bench on my PC and ARM. I just picked two to show the string processing and floating point crunching, and cpuinfo is shorten to show the relevance only.

#! cat /proc/cpuinfo
model name : Intel(R) Core(TM)2 Duo CPU     T7300  @ 2.00GHz
stepping : 11
cpu MHz : 800.000
cache size : 4096 KB
bogomips : 3990.32

#! make timing
reverse-complement < output-of-fasta-25000000
gcc -O2 reverse-complement.c 1.43u 0.23s 1.67r
gc reverse-complement 2.72u 0.26s 3.00r
gc_B reverse-complement 2.64u 0.31s 2.95r

nbody -n 50000000
gcc -O2 -lm nbody.c 27.78u 0.00s 27.92r
gc nbody 54.01u 0.00s 54.06r
gc_B nbody 52.18u 0.00s 52.27r

lucid@lucid-desktop:~$ cat /proc/cpuinfo
Processor : ARMv7 Processor rev 5 (v7l)
BogoMIPS : 999.42
Features : swp half thumb fastmult vfp edsp neon vfpv3 
CPU implementer : 0x41
CPU architecture: 7
Hardware : Freescale MX53 LOCO Board

lucid@lucid-desktop:~/go/test/bench$ gomake timing
reverse-complement < output-of-fasta-25000000
gcc -O2 reverse-complement.c 8.00u 1.14s 10.27r
gc reverse-complement 22.97u 1.26s 27.62r
gc_B reverse-complement 22.09u 1.52s 30.77r

nbody -n 50000000
gcc -O2 -lm nbody.c 316.08u 0.40s 389.46r
gc nbody  645.32u 686.06s 1843.30r
gc_B nbody 653.49u 640.93s 1373.40r

Not surprisingly, Cortex-A8 VFP is known to be very slow, due to its non-pipeline architecture. But I don't know it is so tortoise-like. Will be Cortex-A9 much better? Once I get my OMAP4430 panda board work, I will report it here.

Just to confirm I was using VFP not the soft-float, I created each version and compare:

$    5l -F -o nbody.arm5 nbody.5
$    5l -o nbody.arm6 nbody.5
$    time ./nbody.arm6 -n 50000
real 0m1.316s
user 0m1.310s
sys 0m0.000s

$    time ./nbody.arm5 -n 50000
real 0m30.788s
user 0m29.830s
sys 0m0.000s

By the way, the proper way to run glib is via pkg-config as below. But I gave up trying to fix run().

run 'gcc -O2 `pkg-config --cflags glib-2.0` k-nucleotide.c `pkg-config --libs glib-2.0`' a.out <x


(the mouse is for illustration only, not come with the board)

May 18, 2011

Black Perl

偶然看到一篇 Perl 语言的诗篇, 据说可以在 Perl 3 上编译. 作者 Larry Wall, 语言学家, Perl 的发明者和导师. 神奇的是我熟悉的 Perl 很像 C, 变量全都有钱, 而这个诗篇, 除了 die, 没一点 Perl 的特征. 所以人称 Perl "There's more than one way to do it", 果不其然.

BEFOREHAND: close door, each window & exit; wait until time.
    open spellbook, study, read (scan, select, tell us);
write it, print the hex while each watches,
    reverse its length, write again;
    kill spiders, pop them, chop, split, kill them.
        unlink arms, shift, wait & listen (listening, wait),
sort the flock (then, warn the "goats" & kill the "sheep");
    kill them, dump qualms, shift moralities,
    values aside, each one;
        die sheep! die to reverse the system
        you accept (reject, respect);
next step,
    kill the next sacrifice, each sacrifice,
    wait, redo ritual until "all the spirits are pleased";
    do it ("as they say").
do it(*everyone***must***participate***in***forbidden**s*e*x*).
return last victim; package body;
    exit crypt (time, times & "half a time") & close it,
    select (quickly) & warn your next victim;

AFTERWORDS: tell nobody.
    wait, wait until time;
    wait until next year, next decade;
        sleep, sleep, die yourself,
        die at last
# Larry Wall

再找找看, 早有人作了 C 的诗仙. 可编译, 可运行, 更是越读越有趣!

    double time, me= !0XFACE,
    not; int rested,   get, out;
    main(ly, die) char ly, **die ;{
        signed char lotte,
dear; (char)lotte--;
    for(get= !me;; not){
    1s -  out & out ;lie;{
    char lotte, my= dear,
    **let= !!me *!not+ ++die;
"The gloves are OFF this time, I detest you, snot\n\0sed GEEK!");
    do {not= *lie++ & 0xF00L* !me;
    #define love (char*)lie -
    love 1s *!(not= atoi(let
    [get -me?
(char)lotte: my- *love -
    'I'  -  *love -  'U' -
    'I'  -  (long)  - 4 - 'U' ])- !!
    (time  =out=  'a'));} while( my - dear
    && 'I'-1l  -get-  'a'); break;}}
(char)*lie++, (char)*lie++; hell:0, (char)*lie;
    get *out* (short)ly   -0-'R'-  get- 'a'^rested;
    do {auto*eroticism,
    that; puts(*( out
        - 'c'
-('P'-'S') +die+ -2 ));}while(!"you're at it");
for (*((char*)&lotte)^=
    (char)lotte; (love ly) [(char)++lotte+
    !!0xBABE];){ if ('I' -lie[ 2 +(char)lotte]){ 'I'-1l ***die; }
    else{ if ('I' * get *out* ('I'-1l **die[ 2 ])) *((char*)&lotte) -=
    '4' - ('I'-1l); not; for(get=!
get; !out; (char)*lie  &  0xD0- !not) return!!
    do{ not* putchar(lie [out
    *!not* !!me +(char)lotte]);
    not; for(;!'a';);}while(
        love (char*)lie);{
register this; switch( (char)lie
    [(char)lotte] -1s *!out) {
    char*les, get= 0xFF, my; case' ':
    *((char*)&lotte) += 15; !not +(char)*lie*'s';
    this +1s+ not; default: 0xF +(char*)lie;}}}
    get - !out;
    if (not--)
    goto hell;
        exit( (char)lotte);}