learning plus: parallel synthesis @ llvm

2011年3月20日星期日

parallel synthesis @ llvm

我們在machine code 上看到的指令是cycle by cycle 的形式,但實際在硬體上卻可以透過多個 hardware resource 做到 parallel 的方式.如在 Verilog 的語法上用 '(' ')' 來 assign hardware. 底下就用 C 2 Verilog 內的例子來講解 ps: 不管 RHS, LHS 的相關性.這邊只考慮同 alu type 的 instruction. 且至少連續3個 ALU 以上的 instruction. 假設一個sample 的 hardware IP named test. ex : sample c code

void test(int a,int b, int c, int d, int e, int *o){
     *o = a+b+c+d+e;
}

透過 GCC compile 之後的 machine code 可以看出是 cycle by cycle 的型態.

test:
        movl    8(%esp), %eax
        addl    4(%esp), %eax
        addl    12(%esp), %eax
        addl    16(%esp), %eax
        addl    20(%esp), %eax
        movl    24(%esp), %ecx
        movl    %eax, (%ecx)
        ret

在 llvm IR 中也是如此.

define void @test(i32 %a, i32 %b, i32 %c, i32 %d, i32 %e, i32* nocapture %o) nounwind {
entry:
  %add = add i32 %b, %a
  %add3 = add i32 %add, %c
  %add5 = add i32 %add3, %d
  %add7 = add i32 %add5, %e
  store i32 %add7, i32* %o, align 4, !tbaa !0
  ret void
}

轉成 Graph 來看.大致是如此

但是從 hardware 的角度來看,不管 hardware resource constrains 可以長成這樣子

這邊我們只要用到3個clock cycle 跟兩個 ALU IP ADDs 就可以完成.有了以上的觀念後. 其實最後目標就是改變 llvm 的 IR 型態來符合我們的假設. 改變後的 IR, 可發現出 parallel 的方式.

define void @test(i32 %a, i32 %b, i32 %c, i32 %d, i32 %e, i32* nocapture %o) nounwind {
entry:
  %lowlevel = add i32 %b, %a
  %lowlevel1 = add i32 %c, %d
  %lowlevelOdd = add i32 %e, %lowlevel1
  %headNode = add i32 %lowlevel, %lowlevelOdd
  store i32 %headNode, i32* %o, align 4, !tbaa !0
  ret void
}

ps: 其實在 verilog 中,可以用 (a+b)+(c+d)+e 的方式來做出 parallel 的合成.