底下列出目前在
C to Verilog 所看到的 features.
@ Pre-Synthesis part
1. Parallel support
Ref :
parallel synthesis @ llvm
2. reduce redundant nodes
ex: original
a = 3+4;
b = b+a;
ex: new
// cut instruction "a = 3+4" & move the new point to next instruction
b = b+7
3. reduce redundant bit width
ex: original
int32_t a,c;
a =3;
c = 32'b(a)+32'b(1);
ex: new
int8_t a;
int9_t c;
a =3;
c = 8'b(a)+8'b(1);
4. remove Allocation
ex: original
void A(){
...
B (a,b,&c);
}
void B(int a,int b, int *c){
*c = a+b;
}
ex: new
void A(){
...
c = a+b;
}
5. instruction priority
c=a+b; instruction 1
d=c+1; instruction 2
priority 1 > 2
...
6. PHI node ignore.
@ constrain at each Basic Block the data had already done.
ps: @ preprocessor Basic Block the PHI in-alive Nodes had already done.
ex:
%i.08 = phi i32 [ 0, %entry ], [ %inc, %popCnt.exit ]
%arrayidx = getelementptr i32* %B, i32 %i.08
...
//0 @ Basic block entry
//%inc @ Basic block popCnt.exit & replace it be %i.08
7. reduce redundant instructions
@ PHI part
%i.06.i = phi i32 [ 0, %for.body ], [ %inc.i, %for.body.i ]
%i.07.i = phi i32 [ 0, %for.body ], [ %inc.i, %for.body.i ]
// cut %i.07.i instruction
@ Operand
%add.01.i = add i32 %and.i, %sum.07.i
%add.02.i = add i32 %and.i, %sum.07.i
// cut %add.02.i instruction
8. loop
with hardware resource constrain "2" Add.
ex : original
for(int i=0; i<10; i++)
a[i] = b[i]+1;
...
ex : new
for(int i=0; i<10; i=i+2){
a[i] = b[i]+1;
a[i+1] = b[i+1]+1;
}
...
@ Synthesis part
1. Global Value
@ In Verilog Module not support the local value
ex : @ Verilog
module xx(...);
input ...
output ...
reg ...
wire ...
...
always@(posedge clk)begin
...
end
...
endmodule
2. Schedule list
external memory support.
split the memory load instruction into two Basic blocks
@ Memory request(address/mode)block , Memory (wait the require data)read block.
ps: each Basic Block should be done at one cycle clock.
ex: original
for.body:
// @ address/mode phase
%arrayidx = getelementptr i32* %B, i32 %i.08
// @ data phase
%tmp3 = load i32* %arrayidx, align 4, !tbaa !0
...
ex: new
for.body.phase1:
%arrayidx = getelementptr i32* %B, i32 %i.08
br label %for.body.phase2
for,body.phase2:
%tmp3 = load i32* %arrayidx, align 4, !tbaa !0
@ Verilog view
module(...)
output [31:0] mem_address;
output [0 :0] mem_mode;
output [31:0] mem_store_data;
input [31:0] mem_load_data;
reg [3 :0] cur_state,nxt_state;
reg [31:0] a;
....
case(cur_state)
for_body_phase1 : begin
mem_address = 0x00000000;
mem_mode = read;
nxt_state = for_body_phase2;
end
for_body_phase2 : begin
a = mem_load_data;
nxt_state = alu_phase;
end
alu_phase :
...
...
endcase
...
cur_state = nxt_state;
3. pipeline support
unroll the loop & pipeline insert.
@ example case
ex: original
%cat test.c
static inline unsigned int popCnt(unsigned int input) {
unsigned int sum = 0;
for (int i = 0; i < 32; i++) {
sum += (input) & 1;
input = input/2;
}
return sum;
}
how to unroll loop ?
// step1. gen IR bytecode
%clang -O3 -emit-llvm test.c -e -o test.bc
// step2. unroll the loop in bytecode
// you can check the opt by "opt -help | grep loop"
%opt -loop-unroll -unroll-count 20 test.bc -debug | llvm-dis
// step3. if step2 pass, then output new bytecode
%opt -loop-unroll -unroll-count 20 test.bc > opt_test.bc
un-pipeline IR & view
%llvm-dis opt_test.bc > opt_test.ll
%cat opt_test.ll
for.body.i.1: ...Instruction A br label %for.body.i.2
for.body.i.2: ...Instruction B br label %for.body.i.3
for.body.i.3: ...Instruction C br label %for.body.i.4
for.body.i.4: ...Instruction D br label %for.body.i.5
...
pipeline it
for.body.i.1 : ...Instruction A br label %for.body.i.2
for.body.i.2 : ...Instruction B,A br label %for.body.i.3
for.body.i.3 : ...Instruction C,B,A br label %for.body.i.4
for.body.i.4 : ...Instruction D,C,B,A br label %for.body.i.5
for.body.i.6 : ...Instruction D,C,B br label %for.body.i.7
for.body.i.8 : ...Instruction D,C br label %for.body.i.9
for.body.i.10: ...Instruction D br label %for.exit.i.0
for.exit.i.0 :
...
沒有留言:
張貼留言