May 31, 2011

Easy Going with ARM, Kung Fu Panda [2]

One of the fastest ARM board we can get today is $179 Pandaboard from Digikey. Do prepare to wait for months though.

I chose to install the headless image of ubuntu 11.04. Because my laptop has a SD slot on /dev/mmcblk0, the installation process was as smooth as my Teflon pan.

  1. insert SD card to Laptop
  2. sudo umount /dev/mmcblk0
  3. sudo sh -c 'zcat ubuntu-11.04-preinstalled-headless-armel+omap4.img.gz > /dev/mmcblk0'
  4. sync
  5. insert SD card to pandaboard, plug power, LAN and USB-Serial cable in
  6. On the laptop terminal: TERM=vt100 minicom
  7. turn on the pandaboard, on minicom after the uboot comes the standard ubuntu installation process, and finally gives us a shell prompt.
The Go installation is the same as in my last post, only much faster. Now comes to the benchmark, how does 1GHz dual core ARM A9 compare to my 2GHz dual core Intel x86, and 1GHz single core ARM A8?

Not surprisingly, the number of cores does not count here since no parallel processing is benchmarked. For string processing 1GHz A9 is slightly faster than A8 , but still more than twice slower than 2GHz x86 core. A9's VFP has been greatly improved, 5x faster than A8 by gcc.

But surprisingly, and I am astonished to see, for the floating point crunching on A9, gc is 11x slower than optimized gcc. This is very unfortunate because what I am interested in Go on ARM is OpenGL ES, which is all about matrix operations on floating points.

[update] 11x slower is caused by unoptimized pkg/math/sqrt.go, since ARM VFP has VSQRT instruction, it should not be hard to speed it up.

[update 2] I made it. Now it is 7x faster

nbody -n 50000000
        gcc -O2 -lm nbody.c     71.40u 0.00s 71.43r
        gc nbody        120.93u 0.00s 120.94r
        gc_B nbody      119.78u 0.00s 119.80r

go@localhost:~/go/test/bench$ cat /proc/cpuinfo
Processor   : ARMv7 Processor rev 2 (v7l)
processor   : 0
BogoMIPS  : 2009.29
processor   : 1        
BogoMIPS  : 1963.08         
Features  : swp half thumb fastmult vfp edsp thumbee neon vfpv3        
Hardware : OMAP4 Panda board 

go@localhost:~/go/test/bench$ gomake timing  
reverse-complement < output-of-fasta-25000000
gcc -O2 reverse-complement.c    7.88u 1.55s 9.54r
gc reverse-complement   18.36u 1.91s 20.29r
gc_B reverse-complement 17.75u 2.08s 19.85r
nbody -n 50000000
gcc -O2 -lm nbody.c     71.40u 0.00s 71.41r
gc nbody        862.53u 0.02s 862.78r
gc_B nbody      865.00u 0.05s 865.28r   

(OMAP3, OMAP4, IMX53 and the earth creator)

May 30, 2011

Easy Going with ARM

How easy would it be to Go with ARM? It depends, from super difficult to duper easy. This is the latter case. Just got one from Mouser for US$149 with free FedEx to Singapore. Opened the box, plugged the cables in, inserted the SD card, and pressed the power switch. 

As advertised, I should expect an instant boot-up. But nothing happened. Dead on arrival? Scratched my head. Where's the manual? Searched the box again. Not found. Looked at the board and thought, should I insert the SD or microSD? No harm to try. 

Sliced the microSD from the SD sleeve, put in that slot, powered up. Bingo. AND as I idly lying back on my armchair and glanced the box again... There you are. The manual and CD are just there, pasted on the back of the box cover!

From then on, it could not be easier to just follow the normal procedure to have a Go on the pre-installed Ubuntu Lucid on the IMX53 Starter-Kit.
  1. sudo apt-get update
  2. sudo apt-get install mercurial bison ed gawk gcc libc6-dev make
  3. hg clone -u release go
  4. cd go/src; ./make.bash
It just needs much longer time (an hour?), because it only runs on a 1GHz ARM Cortex-A8, with 1GB DDR3 RAM, and worst of all, SD is way too slow. Great news is this pixie has SATA port. Next time will try find a spare hard disk and see how speedy it could go.

How slow is it? Here's the go/test/bench on my PC and ARM. I just picked two to show the string processing and floating point crunching, and cpuinfo is shorten to show the relevance only.

#! cat /proc/cpuinfo
model name : Intel(R) Core(TM)2 Duo CPU     T7300  @ 2.00GHz
stepping : 11
cpu MHz : 800.000
cache size : 4096 KB
bogomips : 3990.32

#! make timing
reverse-complement < output-of-fasta-25000000
gcc -O2 reverse-complement.c 1.43u 0.23s 1.67r
gc reverse-complement 2.72u 0.26s 3.00r
gc_B reverse-complement 2.64u 0.31s 2.95r

nbody -n 50000000
gcc -O2 -lm nbody.c 27.78u 0.00s 27.92r
gc nbody 54.01u 0.00s 54.06r
gc_B nbody 52.18u 0.00s 52.27r

lucid@lucid-desktop:~$ cat /proc/cpuinfo
Processor : ARMv7 Processor rev 5 (v7l)
BogoMIPS : 999.42
Features : swp half thumb fastmult vfp edsp neon vfpv3 
CPU implementer : 0x41
CPU architecture: 7
Hardware : Freescale MX53 LOCO Board

lucid@lucid-desktop:~/go/test/bench$ gomake timing
reverse-complement < output-of-fasta-25000000
gcc -O2 reverse-complement.c 8.00u 1.14s 10.27r
gc reverse-complement 22.97u 1.26s 27.62r
gc_B reverse-complement 22.09u 1.52s 30.77r

nbody -n 50000000
gcc -O2 -lm nbody.c 316.08u 0.40s 389.46r
gc nbody  645.32u 686.06s 1843.30r
gc_B nbody 653.49u 640.93s 1373.40r

Not surprisingly, Cortex-A8 VFP is known to be very slow, due to its non-pipeline architecture. But I don't know it is so tortoise-like. Will be Cortex-A9 much better? Once I get my OMAP4430 panda board work, I will report it here.

Just to confirm I was using VFP not the soft-float, I created each version and compare:

$    5l -F -o nbody.arm5 nbody.5
$    5l -o nbody.arm6 nbody.5
$    time ./nbody.arm6 -n 50000
real 0m1.316s
user 0m1.310s
sys 0m0.000s

$    time ./nbody.arm5 -n 50000
real 0m30.788s
user 0m29.830s
sys 0m0.000s

By the way, the proper way to run glib is via pkg-config as below. But I gave up trying to fix run().

run 'gcc -O2 `pkg-config --cflags glib-2.0` k-nucleotide.c `pkg-config --libs glib-2.0`' a.out <x


(the mouse is for illustration only, not come with the board)

May 18, 2011

Black Perl

偶然看到一篇 Perl 语言的诗篇, 据说可以在 Perl 3 上编译. 作者 Larry Wall, 语言学家, Perl 的发明者和导师. 神奇的是我熟悉的 Perl 很像 C, 变量全都有钱, 而这个诗篇, 除了 die, 没一点 Perl 的特征. 所以人称 Perl "There's more than one way to do it", 果不其然.

BEFOREHAND: close door, each window & exit; wait until time.
    open spellbook, study, read (scan, select, tell us);
write it, print the hex while each watches,
    reverse its length, write again;
    kill spiders, pop them, chop, split, kill them.
        unlink arms, shift, wait & listen (listening, wait),
sort the flock (then, warn the "goats" & kill the "sheep");
    kill them, dump qualms, shift moralities,
    values aside, each one;
        die sheep! die to reverse the system
        you accept (reject, respect);
next step,
    kill the next sacrifice, each sacrifice,
    wait, redo ritual until "all the spirits are pleased";
    do it ("as they say").
do it(*everyone***must***participate***in***forbidden**s*e*x*).
return last victim; package body;
    exit crypt (time, times & "half a time") & close it,
    select (quickly) & warn your next victim;

AFTERWORDS: tell nobody.
    wait, wait until time;
    wait until next year, next decade;
        sleep, sleep, die yourself,
        die at last
# Larry Wall

再找找看, 早有人作了 C 的诗仙. 可编译, 可运行, 更是越读越有趣!

    double time, me= !0XFACE,
    not; int rested,   get, out;
    main(ly, die) char ly, **die ;{
        signed char lotte,
dear; (char)lotte--;
    for(get= !me;; not){
    1s -  out & out ;lie;{
    char lotte, my= dear,
    **let= !!me *!not+ ++die;
"The gloves are OFF this time, I detest you, snot\n\0sed GEEK!");
    do {not= *lie++ & 0xF00L* !me;
    #define love (char*)lie -
    love 1s *!(not= atoi(let
    [get -me?
(char)lotte: my- *love -
    'I'  -  *love -  'U' -
    'I'  -  (long)  - 4 - 'U' ])- !!
    (time  =out=  'a'));} while( my - dear
    && 'I'-1l  -get-  'a'); break;}}
(char)*lie++, (char)*lie++; hell:0, (char)*lie;
    get *out* (short)ly   -0-'R'-  get- 'a'^rested;
    do {auto*eroticism,
    that; puts(*( out
        - 'c'
-('P'-'S') +die+ -2 ));}while(!"you're at it");
for (*((char*)&lotte)^=
    (char)lotte; (love ly) [(char)++lotte+
    !!0xBABE];){ if ('I' -lie[ 2 +(char)lotte]){ 'I'-1l ***die; }
    else{ if ('I' * get *out* ('I'-1l **die[ 2 ])) *((char*)&lotte) -=
    '4' - ('I'-1l); not; for(get=!
get; !out; (char)*lie  &  0xD0- !not) return!!
    do{ not* putchar(lie [out
    *!not* !!me +(char)lotte]);
    not; for(;!'a';);}while(
        love (char*)lie);{
register this; switch( (char)lie
    [(char)lotte] -1s *!out) {
    char*les, get= 0xFF, my; case' ':
    *((char*)&lotte) += 15; !not +(char)*lie*'s';
    this +1s+ not; default: 0xF +(char*)lie;}}}
    get - !out;
    if (not--)
    goto hell;
        exit( (char)lotte);}