Image
image
image
image


Design Articles

Fundamentals of High Level Synthesis—Part 3

This article is excerpted from Chapter 4: Fundamentals of High Level Synthesis from the newly published book High Level Synthesis Blue Book by Michael Fingerhoff (Copyright 2010 by Mentor Graphics Corporation). This is the third of four weekly installments. If you can't wait--or want to learn more--the book is available at Amazon.com.

By Michael Fingeroff, Mentor Graphics

Nested Loops

Nested loops and the effects of pipelining nested loops is often one of the most misunderstood concepts of high-level C++ synthesis. Understanding the resulting hardware behavior from synthesizing non-pipelined and pipelined nested loops allows designers to more easily meet performance and area requirements. The simple accumulator that has been used in previous examples can be extended to illustrate the effects of nested loops.

Example 4-14. Nested Loop Accumulator

   1  #include “accum.h”
   2  #include <ac_int.h>
   3  #define MAX 100000
   4  void accumulate(int din[2][4], int &dout){
   5    int acc=0;
   6  ROW:for(int i=0;i<2;i++){
   7      if(acc>MAX)  
   8        acc = MAX;
   9    COL:for(int j=0;j<4;j++){
   10       acc += din[i][j];
   11     }
   12   }
   13   dout = acc;
   14 }

The following enhancements to the C++ accumulator designer were made:

1.   Line 4- The input data is a 2x4 array of type integer.
2.   Lines 6 and 10 - Two loops, ROW and COL are nested to index the rows and columns of the 2x4 array.
3.   Lines 7 and 8 - The accumulate variable “acc” is saturated to keep it from exceeding a maximum value at the beginning of the ROW loop. This is somewhat of an artificial example but helps illustrate the effects of nesting loops.

Unconstrained Nested Loops

If the nested loop accumulator in Example 4-14 is synthesized with both loops left rolled and no loop pipelining, the resulting hardware has a behavior similar to the state diagram shown in Figure 4-21. Note that for this example it is assumed that each iteration of the COL loop can execute in one clock cycle.

Figure 4-21. State Diagram of Unconstrained Nested Loops

The state diagram in Figure 4-21 shows that unconstrained nested loops have an overhead associated with the computation of the outer loop body and index (C1_ROW and C2_ROW). This overhead has the impact of increasing the latency of the design. This increased latency can be substantial compared to the number of clock cycles required to perform the main computation of the algorithm. Figure 4-21 shows that the execution of the design requires two c-steps (clock cycles) for the main loop, and two c-steps for the ROW loop, in addition to the eight cycles required to execute the COL loop twice. This means that the entire design takes 14 cycles to accumulate the eight values of din[2][4], which in turn means that 43% of the execution time is taken up by the main and ROW loop overhead in this example. Figure 4-22 shows the schedule for the unconstrained design.

Figure 4-22. Schedule of Unconstrained Nested Loops


NOTE: Unconstrained nested loops can increase latency because of the overhead of computing the loop exit conditions and loop bodies separately.

Pipelined Nested Loops

Loop pipelining can be applied in order to improve design performance.

NOTE: It is generally good practice to begin pipelining starting with the innermost loops and working up towards the top-most loops. This should in general give the best area/performance trade-off.
For this example the innermost loop is the COL loop. However since it was assumed that each iteration of the COL loop only requires one c-step to execute there is no benefit in pipelining this loop.
Pipelined ROW Loop With II=1

NOTE: When nested loops are pipelined together the loops are “flattened” into a single loop. The initiation interval constraint is then applied to the flattened loop.
Figure 4-23 shows the state diagram that illustrates the effect of pipelining the ROW and COL loops for Example 4-14.

Figure 4-23. State Diagram of ROW Loop with II=1

Figure 4-23 shows how loop flattening removes the overhead of the C1_ROW and C2_ROW states by combining the saturation and ROW loop index logic into the same loop with the COL loop. Although pipelining nested loops improves performance in terms of latency and throughput, it is not without cost. The control logic becomes progressively more complex as more and more nested loops are pipelined. This can lead to larger area, or failure to schedule in some cases. So a good rule of thumb is to start pipelining the inner loops and work your way towards the outer loops until the performance target is met. Figure 4-24 shows the schedule when the ROW loop is pipelined with II=1.

Figure 4-24. Schedule of ROW Loop with II=1

Figure 4-24 shows that by pipelining the ROW and COL loops together, the two cycle overhead of the ROW loop has been absorbed into the flattened loop allowing the nested ROW and COL loops to execute in eight clock cycles. The only overhead remaining is caused by the main loop.

Pipelined main Loop with II=1
Similar to pipelining the ROW loop, pipelining the main loop causes the main, ROW, and COL loops to be flattened into a single loop. This has the effect of moving the loop iterator initialization and the write of the output “dout” into the ROW_COL loop and executing them conditionally, as shown in the state diagram of Figure 4-25. The net result is to increase the design performance at the expense of making the control logic more complicated. Figure 4-26 shows the schedule with the main loop pipelined. The two cycle overhead of the main loop has been flattened along with the ROW and COL loops allowing the design to achieve maximum performance.

Figure 4-25. State Diagram of main Loop with II=1



Figure 4-26. Schedule of Main Loop with II=1

Unrolling Nested Loops

Loop unrolling can be applied to nested loops to increase the design performance, often at the expense of larger area. Because of this, designers must be methodical in choosing how much to unroll a loop. For nested loops with a large number of iterations it is more commonplace to leave the outer loop(s) rolled and partially or fully unroll the inner loop when trying to increase design performance. This is also usually done in combination with loop pipelining.

NOTE: In general it is always better to pipeline loops first before using loop unrolling. This is because loop pipelining often gives a significant boost in performance with a smaller cost in terms of area.
Loop unrolling on the other hand usually has a greater impact on area when the loop body contains a large number of operations. This is because unrolling replicates the loop body leading to larger numbers of resources being scheduled in parallel.

Unrolling the Innermost Loop

Example 4-15 shows a C++ design that uses two nested loops to separately accumulate the rows of a two-dimensional array. This example is synthesized with the COL loop fully unrolled and the ROW loop pipelined with II=1.

Example 4-15. Unrolling the Inner Loop

#include “accum.h”
void accumulate(int din[2][4], int dout[2]){
  int acc[2];
 ROW:for(int i=0;i<2;i++){
     acc[i] = 0;
  COL:for(int j=0;j<4;j++){
      acc[i] += din[i][j];
    }
    dout[i] = acc[i];
  }
}

Fully unrolling the COL loop has the same effect as manually replicating the COL loop body, shown in Example 4-16.

Example 4-16. Unrolling the Inner Loop Manually

#include “accum.h”
void accumulate(int din[2][4], int dout[2]){
  int acc;
 ROW:for(int i=0;i<2;i++){
    acc=0;
    acc += din[i][0];
    acc += din[i][1];
    acc += din[i][2];
    acc += din[i][3];
    dout[i] = acc;
  }
}

Example 4-16, which shows the effects of duplicating the inner loop body is transformed during scheduling into something that more closely resembles the code shown in Example 4-17.

Example 4-17. Unrolling the Inner Loop Transformation

#include “accum.h”
void accumulate(int din[2][4], int dout[2]){
  int acc=0;
 ROW:for(int i=0;i<2;i++){
      dout[i] = din[i][0]+din[i][1]+din[i][2]+din[i][3];
    }
  }
}

Example 4-17 shows that accumulating four values at a time requires three adders. Assuming that there is sufficient time to schedule the three adders in the same clock cycle, the design schedule looks like that shown in Figure 4-27. Each iteration of the ROW loop executes in one clock cycle, while there is still some overhead caused by not pipelining the main loop.

Figure 4-27. Schedule with Inner Loop Fully Unrolled ROW Loop with II=1

The hardware resulting from synthesizing Example 4-15 is shown in Figure 4-28. High-level synthesis automatically builds a balanced adder tree when unrolling accumulators inside a loop. There are some situations where the tree balancing does not happen automatically when the accumulate is conditional. This is discussed later.

Figure 4-28. Hardware with Inner Loop Fully Unrolled

Rampup/Rampdown of Pipelined Nested Loops

Increasing the clock frequency when synthesizing Example 4-15 at some point requires that the adder tree shown in Figure 4-28 be scheduled over multiple clock cycles or c-steps. Figure 4-29 shows the design schedule where the first two adders are scheduled together in the same c-step, with the second adder stage scheduled in the next c-step. Two pipeline stages are created when the ROW loop is pipelined with II=1, and the design latency and throughput is affected due to pipeline rampup and rampdown, initially discussed in “Classic RISC Pipelining” on page 41. For loops with large number of iterations, the effect of rampup/rampdown may be negligible, and allowing the pipeline to rampdown has the added benefit of allowing all data to be “flushed” from the pipeline stages. In this example the cost of rampup/rampdown is significant compared to the number of iterations for the ROW loop.

Figure 4-29. Schedule of Ramp-up/down with Inner Loop Fully Unrolled

Figure 4-30 shows the hardware generated for the schedule shown in Figure 4-29. The adder tree has been separated into two pipeline stages.

Figure 4-30. Hardware of Ramp-up/down with Inner Loop Fully Unrolled

Rampup Only of Nested Loops with Pipelined Main Loop

A possible solution for increasing the performance for designs that have both rampup and rampdown of the pipeline would be to pipeline the “main” loop with II=1. When this is done the pipeline only ramps up and then runs forever, removing the throughput cost of pipeline rampdown. This is shown in Figure 4-31.

Figure 4-31. Rampup of Nested Loops with Main Loop II=1


CAUTION: There are side effects associated with pipelining the main loop when the design has rolled loops. If IO is mapped to a handshaking interface and is accessed inside of the pipelined loop it can cause the pipeline to stall. This is covered in “Conditional IO” on page 90.

Unrolling the Outer Loop

The previous section illustrated how unrolling the innermost loop replicates the loop body resulting in higher performance. The core architectural feature resulting from unrolling the innermost loop was a balanced adder tree, Figure 4-28. If the inner loop is left rolled and the outer loop is unrolled the inner loop is replicated as many times as the loop is unrolled. Example 4-18 shows the effects of manually unrolling the outer loop where there are now two copies of the inner loop, COL_0 and COL_1.

Example 4-18. Manually Unrolling the Outer Loop

#include “accum.h”
void accumulate(int din[2][4], int dout[2]){
  int acc[2];
 
  acc[0] = 0;
 COL_0:for(int j=0;j<4;j++){
    acc[0] += din[0][j];
  }
  dout[0] = acc[0];
  acc[1] = 0;
 COL_1:for(int j=0;j<4;j++){
    acc[1] += din[1][j];
  }
  dout[1] = acc[1];
}

When possible, high-level synthesis automatically merges all of the replicated loops into a single loop, leading to a number of accumulators running in parallel. Example 4-19 shows the effects of manually merging the two COL loops.

Example 4-19. Manual Merging

#include “accum.h”
void accumulate(int din[2][4], int dout[2]){
  int acc[2];
 
  acc[0] = 0;
  acc[1] = 0;
 COL_0_1:for(int j=0;j<4;j++){
    acc[0] += din[0][j];
    acc[1] += din[1][j];
  }
  dout[0] = acc[0];
  dout[1] = acc[1];
}

Figure 4-32 shows the schedule when the ROW is fully unrolled and all copies of the COL loop are merged.

Figure 4-32. Unrolling the Outer Loop with Loop Merging

Figure 4-33 shows the synthesized hardware resulting from unrolling the outer loop which has had the effect of creating two accumulators running in parallel.

Figure 4-33. Hardware of Unrolling the Outer Loop

Reversing the Loop Order

The previous section illustrated how unrolling the outer loop cause the inner loop to be replicated and merged automatically during synthesis. However, there are situations that prevent automatic merging, and this leads to sub-optimal performance. Example 4-20 shows the accumulator design used in the previous section that has been modified to conditionally assign the index for the “acc” array. This conditional index assignment breaks automatic loop merging.

Example 4-20. Conditional Index Breaks Loop Merging

#include “accum.h”
void accumulate(int din[2][4], int dout[2], bool flag){
  int acc[2];
  int idx;
 ROW:for(int i=0;i<2;i++){
    idx = flag ? i: 1-i; 
    acc[idx] = 0;
  COL:for(int j=0;j<4;j++){
      acc[idx] += din[i][j];
    }
    dout[i] = acc[i]; 
  }
}

Not merging the two copies of the COL loop that result from unrolling the ROW loop causes the loops to be scheduled sequentially (See “Sequential Loops” on page 69). Pipelining the main loop with II=1 causes the two copies of the COL loop, COL_0 and COL_1, to be flattened into the main loop, but they are still be executed sequentially as shown in Figure 4-34.

Figure 4-34. Schedule with Conditional Index and ROW Loop Unrolled

One possible solution to achieve the desired behavior of two accumulators running in parallel is to reverse the order of the ROW and COL loops. However, this must be done carefully since it usually requires that the outer loop body must be moved to the inner loop and executed conditionally. Example 4-21 shows how to manually reverse the loop order.

Example 4-21. Reversing the Loop Order

   1  #include “accum.h”
   2  void accumulate(int din[2][4], int dout[2], bool flag){
   3    int acc[2];
   4    int idx;
   5 
   6  COL:for(int j=0;j<4;j++){
   7    ROW:for(int i=0;i<2;i++){
   8        idx = flag ? i: 1-i; 
   9        if(j==0)
   10         acc[idx] = 0;
   11       acc[idx] += din[i][j];
   12       if(j==3)
   13         dout[i] = acc[i]; 
   14     }
   15   }
   16 }
   17

The following code changes were made in Example 4-21.

•     Lines 6 and 7- reversed the order of the ROW and COL loops
•     Line 8- Moved the index computation into the inner loop body
•     Lines 9 and 10 - Moved the clearing of the accumulators into the inner loop body and made it conditional so that they are only cleared once at the beginning
•     Lines 12 and 13 - Moved the writing of the output into the inner loop body and made the writes conditional so that the output is only written on the final iteration of COL

Sequential Loops

It is not uncommon to have multiple consecutive loops in a C++ design. Although these loops execute sequentially in the simulation of the C++, HLS can be directed to automatically merge these loops and execute them in parallel in hardware. However there are many cases where the C++ code can be written in such a way as to make automatic loop merging impossible. In these cases either the C++ code must be re-written to manually merge the loops if better performance is required, or explicit hierarchy should be used (See “Hierarchical Design” on page 191).

It is important for designers to understand the behavior of the hardware when loop merging does and does not take place so there are no unexpected results.

Simple Independent Sequential Loops

Example 4-22 shows the case where there are two sequential loops that are used to separately accumulate two four-element arrays.

Example 4-22. Independent Sequential Loops

#include “accum.h”
void accumulate(int din0[4], int din1[4],int &dout0, int &dout1){
  int acc0=0;
  int acc1=0;
 ACCUM0:for(int i=0;i<4;i++){
    acc0 += din0[i];
  }
 ACCUM1:for(int i=0;i<4;i++){
    acc1 += din1[i];
  }
  dout0 = acc0;
  dout1 = acc1;
}

High-level synthesis can automatically merge these loops because there are no dependencies between the loops and the indexing of the arrays is based solely on the loop iterators. With loops left rolled and automatically merged, and the main loop pipelined with II=1, the resulting schedule looks like that shown in Figure 4-35.

Figure 4-35. Schedule of Merged Sequential Loops

The schedule shown above indicates that the loop iterations in each of the ACCUM loops can be run at the same time, resulting in a design that has two accumulators and runs in four clock cycles (Figure 4-36). If this kind of performance and increase in area is not required, automatic loop merging can be disabled during synthesis, allowing the loops to execute sequentially. This is discussed in the next section.

Figure 4-36. Hardware of Merged Sequential Loops

Effects of Unmerged Sequential Loops

In some instances sequential loops are not automatically merged. This can occur either intentionally because the design does not require the extra performance, usually at the cost of higher area, or because there are dependencies between the loops that break loop merging optimizations. Other operations such as conditional index assignment for reading or writing an array can also prevent loop merging optimizations. In either of these cases it results in designs that have both longer latency and throughput.

Consider the following design shown in Example 4-23. In this example the accumulated result from the ACCUM0 loop is used as the starting value for the ACCUM1 loop. These loops are not automatically merged since the ACCUM0 loop must finish before the ACCUM1 loop can start.

Example 4-23. Unmerged Sequential Loops

#include “accum.h”
void accumulate(int din0[4], int din1[4],int &dout0, int &dout1){
  int acc0=0;
  int acc1=0;

 ACCUM0:for(int i=0;i<4;i++){
    acc0 += din0[i];
  }
  acc1 = acc0;
 ACCUM1:for(int i=0;i<4;i++){
    acc1 += din1[i];
  }
  dout0 = acc0;
  dout1 = acc1;
}

Figure 4-37 shows the schedule when the main loop of Example 4-23 is pipelined with an II=1. It also illustrates the effect of pipelining the main loop when there are unmerged sequential loops in the design. Pipelining the main loop causes all loops in the design to be flattened, which in turn causes the last iteration of the ACCUM0 loop to be overlapped with the first iteration of the ACCUM1 loop. Although this improves the design performance slightly it has the impact of requiring two adders to implement the hardware. If performance is not an issue it is better to pipeline the ACCUM0 and ACCUM1 loops individually. This should then allow the operations scheduled in each loop to be shared, reducing the area. However pipelining the loops individually can impact the performance since each loop must then ramp-up and ramp-down separately.

Figure 4-37. Schedule of Unmerged Sequential Loops with Main II=1

Figure 4-38 shows the schedule when the ACCUM0 and ACCUM1 loops of Example 4-23 are pipelined with II=1 instead of pipelining the main loop. In this case there is no overlap between the loops and a single adder can be used to implement the hardware. However there is a two cycle performance penalty incurred due to the un-pipelined main loop (C1 Main and C2 Main).

Figure 4-38. Schedule of Unmerged Sequential Loops with ACCUM(s) II=1

Manual merging of sequential loops

It is up to the designer to manually merge sequential loops in situations where HLS does not do it automatically, and merged loops is the desired behavior. This usually means rewriting the C++ code. Example 4-24 shows the manual rewrite of the code in Example 4-23 in order to achieve the best possible performance.

Example 4-24. Manually Merged Sequential Loops with Main II=1


#include “accum.h”
void accumulate(int din0[4], int din1[4],int &dout0, int &dout1){
  int acc0=0;
  int acc1=0;
  int tmp;
 ACCUM0_1:for(int i=0;i<4;i++){
    tmp = din0[i];
    acc0 += tmp;
    acc1 += din1[i]+tmp;
  }
  dout0 = acc0;
  dout1 = acc1;
}

The example shown above manually merged the sequential loops so that the design runs in four clock cycles when pipelining the main loop with II=1. However this design is larger than the previous implementations because it requires three adders, shown in the schedule in Figure 4-39.

Figure 4-39. Schedule of Manual Merged Sequential Loops with Main II=1

Next Week: Pipeline Feedback

"The initiation interval can be set anywhere from a synthesis tool dependent maximum down to an II=1 on any feed-forward design. However, a design with feedback limits the initiation interval to be no less than the delay of the feedback path. There are three types of feedback, data dependent, control dependent, and inter-block feedback."


Bookmark and Share


Insert your comment

Author Name(required):

Author Web Site:

Author email address(required):

Comment(required):

Please Introduce Secure Code:


image
image