Image
image
image
image


Design Articles

Fundamentals of High Level Synthesis—Part 4

This article is excerpted from Chapter 4: Fundamentals of High Level Synthesis from the newly published book High Level Synthesis Blue Book by Michael Fingerhoff (Copyright 2010 by Mentor Graphics Corporation). This is the fourth and final installment. To learn more about high-level synthesis you can order the book at Amazon.com.

By Michael Fingeroff, Mentor Graphics

Pipeline Feedback

The initiation interval can be set anywhere from a synthesis tool dependent maximum down to an II=1 on any feed-forward design. However, a design with feedback limits the initiation interval to be no less than the delay of the feedback path. There are three types of feedback, data dependent, control dependent, and inter-block feedback. Inter-block feedback is discussed in later chapters covering system level design.

Data Feedback

Data feedback occurs when the input to a data path operation is dependent on a variable computed in the previous loop iteration. If the only loop in the design is the main loop the variable must have been declared as static for there to be feedback. Consider the following design:

Example 4-25. Data Feedback Design

   1  void accumulate(int a, int b, int &dout){
   2    static int acc=0;
   3    int  tmp = acc*a;
   4    acc = tmp+b;
   5    dout = acc;
   6  }

Design Constraints
Clock frequency slow
Main loop pipelined with II=1

If the clock frequency for this design is assumed to be very slow the schedule and hardware would look approximately like Figures 4-40 and Figure 4-41. The design schedule shows that pipelining with II=1 is possible since each iteration of the main loop finishes computing “acc” before the next iteration starts. This is also obvious by looking at the hardware diagram.

Figure 4-40. Feedback Within One Clock Cycle



Figure 4-41. Hardware for Feedback Within One Clock Cycle

Now consider the same design from Example 4-25 re-synthesized with the following constraints:

Design Constraints
Clock frequency very fast
Main loop pipelined with II=1
Multiplier constrained to a two-cycle pipelined multiplier

This design cannot be pipelined with II=1 with the given set of constraints listed above. The failed schedule shown in Figure 4-42 illustrates why. To pipeline with II=1 would mean that “acc” is available to be read in the second clock cycle. However, the first pipeline stage is not finished computing “acc” until the edge of the third clock cycle. Another way to look at this is to examine the hardware that is synthesized, shown in Figure 4-43. It takes two clock cycles to compute “tmp” in the feed-forward path. “tmp” is then added to the current value of “b” and fed back to the multiplier. Lines 3 and 4 of Example 4-25 show that each time a new value of “acc” is computed it is available in the next iteration to compute “tmp”. Thus the hardware pipeline cannot be made to run every clock cycle since it must allow the multiplier to flush for each computation of “tmp”. The best possible performance would be pipelining with II=2.

Figure 4-42. Failed Schedule for Multi-cycle Feedback

Figure 4-43. Hardware for Multi-cycle Feedback

The solution to getting the design discussed above to pipeline with II=1 is to modify the design to balance the delays between the feed-forward and feedback paths. This means introducing delay elements in the C++ along the feedback path. The functionality is different from the original design, but there is no other way to pipeline with II=1 and have the RTL match the C++ exactly. Example 4-26 shows Example 4-25 rewritten to balance the delay along the feedback path to match the two cycle feed forward delay. This is done by creating a two element shift register to delay “acc”. The hardware synthesized for Example 4-26 is shown in Figure 4-44.

In general the number of shift register elements needed in the feedback path can be computed as:

Num Shift Elements = (feed-forward latency)/Initiation Interval (II)

Example 4-26. Balancing Feedback Path Delays

   1  void accumulate(int a, int b, int &dout){
   2    static int acc=0;
   3    static int acc_old0;
   4    static int acc_old1;
   5 
   6    int  tmp0 = acc_old1*a;
   7    acc = tmp0+b;
   8    acc_old1 = acc_old0;
   9    acc_old0 = acc;
   10   dout = acc;
   11 }

The details of Example 4-26 are:

  • Lines 3 and 4 define two static variables used to implement the feedback delays.

  • Line 6 uses the delayed feedback “acc_old1” as the input to the multiplier.

  • Lines 8 and 9 implement the shift register to delay “acc” by two clock cycles.

Figure 4-44. Hardware with Balanced Delays on Feedback Path

Control Feedback

Pipelining failures due to feedback are also possible due to the loop control in a design. The deeper the nesting of loops in a design, the more complicated the control becomes, which in turn limits the clock frequency and ability to pipeline a design. Adhering to the recommended coding practices eliminates many of these potential issues. The following design, Example 4-27, illustrates how “bad” coding style can lead to problems when trying to pipeline. This design does not only have larger area than needed, but also fails pipelining for high clock frequencies due to control feedback. The cause of this is due to the 32-bit interface variables being used for the loop upper bounds. The impact of writing the C++ this way was covered in detail in “Optimizing the Loop Counter” on page 54 and “Optimizing the Loop Control” on page 55. Essentially there is a long combinational path created to evaluate the loop exit conditions. The outer loop “X” has to know when the inner loops are finished so it can exit immediately.

Figure 4-45 shows the approximate hardware structure for Example 4-27. Although this is a very rough approximation it clearly shows that there is a combinational path through both 32-bit loop bounds comparisons, which severely impacts performance as the clock frequency is increased. A secondary problem is that the unbounded loops generate 32-bit logic for the loop counters. This can also prevent pipelining due to the feedback on the loop accumulator.

Example 4-27. Control Feedback

   1  void control(int din[8][8],
   2               int dout[8],
   3               int x_size,
   4               int y_size){
   5    int acc;
   6  X:for(int x=0;x<x_size;x++){
   7      acc = 0;
   8    Y:for(int y=0;y<y_size;y++){
   9        acc += din[x][y];
   10       dout[x] = acc;
   11     }
   12   }
   13 }

Figure 4-45. Control Feedback

To minimize the possibility of feedback failures, Example 4-27 should be rewritten using the recommended style discussed previously. This is shown below in Example 4-28. The loops have been bounded, and the control logic for the loop exit reduced by using the appropriate bit widths on “x_size” and “y_size”.

Example 4-28. Minimizing Control Feedback

   1  #include <ac_int.h>
   2  void control(int din[8][8],
   3               int dout[8],
   4               ac_int<4,false> x_size,
   5               ac_int<4,false> y_size){
   6    int acc;
   7  X:for(int x=0;x<8;x++){
   8      acc = 0;
   9    Y:for(int y=0;y<8;y++){
   10       acc += din[x][y];
   11       dout[x] = acc;
   12       if(y==y_size-1)
   13         break;
   14     }
   15     if(x==x_size-1)
   16       break;
   17   }
   18 }

Conditions

Sharing

HLS can automatically share resources when it can prove mutual exclusivity. This means that HLS can theoretically share any similar operators that are in mutually exclusive branches of a condition, no matter how deeply nested the condition. The reality is that there are a number of ways that the C++ can be written and/or constrained so that the proof of mutual exclusivity is not possible. Usually this is due to a combination of either bad coding style or overly complex or deeply nested conditions. Good coding practices should always allow the maximum amount of sharing.

Conditional expressions are specified using the switch-case and if-else statements.

if-else statement

The if-else statement has the following form:

if( condition0 ) {
      statement-list0;
}
else if( condition1 ) {
      statement-list1;
}
...
else {
      statement-listN;
}

The conditions evaluate to a boolean expression and can range from simple boolean conditions to complex function calls. The statement list can be any number of C++ assignments, conditional expressions, or function calls.

switch statement

The switch statement has the following form:

switch( expression ) {
    case 0: statement list0;
    break;
    case 1: statement list1;
    break;
    ...
    case N: statement listN;
    break;
    default: statement list;
    break;
}

The “expression” is typically an integer that selects one of the possible cases. The statement list can be any number of C++ assignments, conditional expressions, or function calls. The statement list for a selected case executes and is followed by a break.
 

NOTE: Although it is possible to have a “case” without a “break” this is not generally good for synthesizable C++. The behavior in C++ is to drop through to the next “case”. However in C++ synthesis this can sometimes cause replication of logic.
 

Keep it Simple

Think about what you want the hardware to do and code your design using good design practices. While it is easy to write complex deeply nested conditions and rely on the HLS tool to share everything, it is just as likely to get less sharing than expected. Consider the following design example (Example 4-29) that conditionally accumulates one of four different arrays based on several IO variables. Each condition branch calls the “acc” function with one of four arrays as the input.

Example 4-29. Automatic Sharing and Nested Conditions

   1  int acc(int data[4]){
   2    int tmp = 0;   
   3    ACC:for(int i=0;i<4;i++)
   4      tmp += data[i];
   5    return tmp;
   6  }
   7  void test(int a[4], int b[4], int c[4], int d[4],
   8            bool sel0, bool sel1, bool sel2, int &dout){
   9    int tmp; 
   10   if(sel0){
   11     if(sel1)
   12       tmp = acc(a);
   13     else if(sel2)
   14       tmp = acc(b);
   15     else
   16       tmp = acc(c);  
   17   }else
   18     tmp = acc(d);
   19   dout = tmp;
   20 }

Design Constraints
Clock frequency slow
Main loop pipelined with II=1
All loops unrolled

There are several potential problems with the design in Example 4-29.

  1. The four calls to the “acc” function are by default all inlined during synthesis. This means that there are four copies of the “ACC” loop that are inlined and optimized. Although it is possible that HLS can still share everything this will in general lead to longer synthesis runtimes since all four copies must be merged back together and shared. One possible solution to improve sharing and runtime would be to make “acc” into a component using a HLS component flow.

  2. Even if everything is shared it is likely that HLS will perform fine-grained sharing, which leads to more MUX logic since each individual operator is shared separately. One possible solution to minimize MUX logic would be to make “acc” into a component.

  3. The conditions in this example are simple and the clock frequency is slow enough so that everything is scheduled in the same clock cycle. As the conditions become more complex, the nesting becomes deeper, and/or the clock frequency increases, it is likely that operators will be scheduled in different clock cycles. This can limit sharing. Making “acc” into a component will not help in these types of situations. The best solution is to rewrite the code so that “acc” is called once.

Example 4-30 shows Example 4-29 rewritten to facilitate sharing. The key is to use the conditions to compute the MUXing of data and control and call the function only once.

Example 4-30. Explicit Sharing and Nested Conditions

   1  int acc(int data[4]){
   2    int tmp = 0;   
   3    ACC:for(int i=0;i<4;i++)
   4      tmp += data[i];
   5    return tmp;
   6  }
   7  void test(int a[4], int b[4], int c[4], int d[4],
   8            bool sel0, bool sel1, bool sel2, int &dout){
   9    int tmp,data[4]; 
   10   for(int i=0;i<4;i++) 
   11     if(sel0){
   12       if(sel1)
   13         data[i] = a[i];
   14       else if(sel2)
   15         data[i] = b[i];
   16       else
   17         data[i] = c[i];
   18     }else
   19       data[i] = d[i];
   20   tmp = acc(data);
   21   dout = tmp;
   22 }

Design Constraints
Clock frequency slow
Main loop pipelined with II=1
All loops unrolled

Example 4-30 will in general give better results than Example 4-29. This is because the nested conditional expression is only used to control the selection of the input array. Once the input array is selected the “acc” function is called once. Doing this allows HLS to easily optimize the adder tree for the “acc” function and the MUX logic is only needed to select the input data. In essence this is coarse grained sharing. This style should be used when using component flows does not give the desired sharing. This example will also have better runtime in general since the “ACC” loop is only inlined once.

Functions and Multiple Conditional Returns

Although multiple returns in function calls are allowed by both C++ and HLS, they are in general a bad idea. This is true both from a code debugging perspective as well as a synthesis quality of results issue. HLS balances the pipeline stages of all conditional branches. Having a return in the branch complicates this and makes it more difficult to pipeline a design. It is best to use a single return at the end of the function. It’s especially bad to use multiple returns to try and make things mutually exclusive.

Consider the following example:

Example 4-31. Multiple Conditional Returns

   1  int acc(int data[4]){
   2    int tmp = 0;   
   3  ACC:for(int i=0;i<4;i++)
   4      tmp += data[i];
   5    return tmp;
   6  }
   7  int test(int a[4], int b[4], bool sel0, bool sel1){ 
   8    if(sel0){
   9      return acc(a);
   10   }
   11   if(sel1){
   12     return acc(b);
   13   }
   14 }

Example 4-31 has several problems that will prevent good QofR.

  1. Although sel0 and sel1 are mutually exclusive in that they are never evaluated together, HLS will typically not be able to prove this and will not share “acc”.

  2. The function returns on both lines 9 and 12. If HLS cannot prove mutual exclusivity it will not be able to pipeline with II=1 if the function return is mapped to an IO.

  3. The function only returns if “sel0” or “sel1” is true. This means that the return value can be undefined. This undefined behavior may cause logic to be optimized away or simulation behavior of the RTL may not match the C++.

Example 4-31 is rewritten to have only one return, shown below.

Example 4-32. Single Function Return

   1  int acc(int data[4]){
   2    int tmp = 0;   
   3  ACC:for(int i=0;i<4;i++)
   4      tmp += data[i];
   5    return tmp;
   6  }
   7  int test(int a[4], int b[4], bool sel0, bool sel1){
   8    int tmp = 0; 
   9    if(sel0){
   10     tmp = acc(a);
   11   }
   12   else if(sel1){
   13     tmp = acc(b);
   14   }
   15   return tmp;
   16 }

Example 4-32 has made the conditions mutually exclusive. A temporary variable “tmp” is used to store the result of each condition and is then returned on line 15. The temporary variable is initialized to zero as well so the return value will never be undefined.

Replacing Conditional Returns with Flags

It is also possible to bypass entire sections of code by using a conditional return. This should also be avoided. It is always possible to replace the conditional return with a flag variable that can bypass the code. Consider the following code fragment:

Example 4-33. Conditional Return to Bypass Code

   1  ...
   2  tmp = 0;
   3  tmp += a;
   4  if(sel0)
   5      return tmp;
   6  tmp += b;
   7  return tmp;

Example 4-33 only adds “b” to “tmp” if “sel0” is false. It should be rewritten as:

Example 4-34. Using Flags to Bypass Code

   1  ...
   2  bool flagl = false;
   3  tmp = 0;
   4  tmp += a;
   5  if(sel0)
   6      flag = true;
   7  if(flag)
   8      tmp += b;
   9  return tmp;

Example 4-34 replaces the conditional return with a flag that is set conditionally. The flag is then used to conditionally bypass the same sections of code that were bypassed by the conditional return. A single return is used at the end of the function.
 

NOTE: A function should have one and only one return.
 

References

1.   John P. Elliot, Understanding Behavioral Synthesis, Kluwer Academic Publishers 1999.


Bookmark and Share


Insert your comment

Author Name(required):

Author Web Site:

Author email address(required):

Comment(required):

Please Introduce Secure Code:


image
image