May 31, 2011

Easy Going with ARM, Kung Fu Panda [2]

One of the fastest ARM board we can get today is $179 Pandaboard from Digikey. Do prepare to wait for months though.

I chose to install the headless image of ubuntu 11.04. Because my laptop has a SD slot on /dev/mmcblk0, the installation process was as smooth as my Teflon pan.

  1. insert SD card to Laptop
  2. sudo umount /dev/mmcblk0
  3. sudo sh -c 'zcat ubuntu-11.04-preinstalled-headless-armel+omap4.img.gz > /dev/mmcblk0'
  4. sync
  5. insert SD card to pandaboard, plug power, LAN and USB-Serial cable in
  6. On the laptop terminal: TERM=vt100 minicom
  7. turn on the pandaboard, on minicom after the uboot comes the standard ubuntu installation process, and finally gives us a shell prompt.
The Go installation is the same as in my last post, only much faster. Now comes to the benchmark, how does 1GHz dual core ARM A9 compare to my 2GHz dual core Intel x86, and 1GHz single core ARM A8?

Not surprisingly, the number of cores does not count here since no parallel processing is benchmarked. For string processing 1GHz A9 is slightly faster than A8 , but still more than twice slower than 2GHz x86 core. A9's VFP has been greatly improved, 5x faster than A8 by gcc.

But surprisingly, and I am astonished to see, for the floating point crunching on A9, gc is 11x slower than optimized gcc. This is very unfortunate because what I am interested in Go on ARM is OpenGL ES, which is all about matrix operations on floating points.

[update] 11x slower is caused by unoptimized pkg/math/sqrt.go, since ARM VFP has VSQRT instruction, it should not be hard to speed it up.

[update 2] I made it. Now it is 7x faster

nbody -n 50000000
        gcc -O2 -lm nbody.c     71.40u 0.00s 71.43r
        gc nbody        120.93u 0.00s 120.94r
        gc_B nbody      119.78u 0.00s 119.80r

go@localhost:~/go/test/bench$ cat /proc/cpuinfo
Processor   : ARMv7 Processor rev 2 (v7l)
processor   : 0
BogoMIPS  : 2009.29
processor   : 1        
BogoMIPS  : 1963.08         
Features  : swp half thumb fastmult vfp edsp thumbee neon vfpv3        
Hardware : OMAP4 Panda board 

go@localhost:~/go/test/bench$ gomake timing  
reverse-complement < output-of-fasta-25000000
gcc -O2 reverse-complement.c    7.88u 1.55s 9.54r
gc reverse-complement   18.36u 1.91s 20.29r
gc_B reverse-complement 17.75u 2.08s 19.85r
nbody -n 50000000
gcc -O2 -lm nbody.c     71.40u 0.00s 71.41r
gc nbody        862.53u 0.02s 862.78r
gc_B nbody      865.00u 0.05s 865.28r   

(OMAP3, OMAP4, IMX53 and the earth creator)