Design Articles

Fundamentals of High Level Synthesis—Part 2

This article is excerpted from Chapter 4: Fundamentals of High Level Synthesis from the newly published book High Level Synthesis Blue Book by Michael Fingerhoff (Copyright 2010 by Mentor Graphics Corporation). New installments of this chapter will appear every Monday for the next four weeks. If you can't wait--or want to learn more--the book is available at

By Michael Fingeroff, Mentor Graphics


HLS book

One of the most important features of HLS for tuning design performance is Loop Unrolling. However, it is necessary first to discuss what constitutes a “loop” in C++. Loops are the primary mechanism for applying high level synthesis constraints as well as moving data, or IO, into and out of an algorithm. The style in which loops are written can have a significant impact on the quality of results of the generated hardware. In order to talk about how to write loops it’s helpful to introduce a few definitions:

•     Interface synthesis - the process of mapping top-level C++ variables to resources that implement an interface protocol (wire, handshake, memory).
•     Loop iterations - the number of times the loop runs before it exits.
•     Loop iterator - the variable used to compute the loop iteration.
•     Loop body - the code between the start and the end of the loop.
•     Loop unrolling - the number of times to copy the loop body.
•     Loop pipelining - how often to start the next iteration of the loop.

What’s in a Loop?

In HLS a design always has one loop which corresponds to the top-level function call. This is known as the “main loop” (See line 1 of Example 4-4).

Example 4-4. The Main Loop

1  void top(int din, int& dout){
2    dout = din;
3  }

The “main loop” is a continuously running loop, which means that it runs for an infinite number of iterations. The analog to this can be seen from the equivalent Verilog module implementation that was shown in the top-level interface section. Once that Verilog module is reset it runs forever as long as the clock is supplied.

There are three ways to specify a loop in C++; using the “for” loop, “while” loop, and “do-while” loop. The syntax is as follows:

“for” Loop


LABEL: for( initialization; test-condition; increment ) {
    statement-list or loop body;

The “for” construct is a general looping mechanism consisting of 4 parts:

1.   initialization - which consists of zero or more comma-delimited variable initialization statements
2.   test-condition - which is evaluated to determine if the execution of the for loop continues
3.   increment - which consists of zero or more comma-delimited statements that increment variables
4.   statement-list - which consists of zero or more statements that execute each time the loop is executed.

Example 4-5. “for” Loop

#include “simple_for_loop.h”
void simple_for_loop(int din[4], int dout[4]){
 FOR_LOOP:for(int i=0;i<4;i++){
    dout[i] = din[i];

The example “for” loop shown in Example 4-5 copies four 32-bit values from din to dout. The for loop has initialization “int i=0”, test condition “i<4”, and increment “i++”.

“while” Loop

The “while” keyword is used as a looping construct that executes the loop body as long as condition is tested as true. If the condition starts off as false, the loop body is never executed. (You can use a do loop to guarantee that the loop body executes at least once.)


LABEL: while(test-condition) {
    statement-list or loop body;

Example 4-6. “while” Loop

#include “simple_while_loop.h”
void simple_while_loop(int din[4], int dout[4]){
  int i=0;
    dout[i] = din[i];

The “while” loop shown in Example 4-6 has the same functionality of the previous “for” loop example.

“do” Loop

The “do” keyword is used as a looping construct that executes the loop body until the condition is tested as false. The loop body always executes at least once.


LABEL: do{
    statement-list or loop body;
 } while(test-condition);

Example 4-7. “do” Loop

#include “simple_do_loop.h”
void simple_do_loop(int din[4], int dout[4]){
  int i=0;
    dout[i] = din[i];

The “do” loop shown in Example 4-7 has the same functionality as the previous “for” and “while” loop examples.

Rolled Loops

NOTE: If a loop is left “rolled”, each iteration of the loop takes at least one clock cycle to execute in hardware. This is because there is an implied “wait until clock” for the loop body

Consider the following C++ example that uses a “for” loop to accumulate four 32-bit integers from an array:

Example 4-8. C++ Accumulate Using Loops

1  void accumulate4(int din[4], int &dout){
2    int acc=0;
3  ACCUM:for(int i=0;i<4;i++){
4      acc += din[i];
5    }
6    dout = acc;
7  } 

Design Constraints
Main loop pipelined with II=1
All loops left rolled

Although the loop is left rolled notice that the design has been pipelined with an II=1. This was done intentionally in order to ignore the effects of the extra clock cycle required for allowing the write of “dout” to complete, as it was discussed in “Loop Pipelining” on page 41. The effects of pipelining loops is covered in more detail in later sections. Figure 4-11 shows the schedule of the loop iterations for Example 4-8.

Figure 4-11. Schedule for Accumulate Using Loops

Figure 4-11 shows that each call to the “accumulate” function requires four clock cycles to accumulate the four 32-bit values in Example 4-8. This is because the loop has been left rolled and there is an implied “wait until clock” at the end of the loop body. The synthesized hardware would have the approximate structure shown in Figure 4-12.

Figure 4-12. Hardware Implementation - Accumulate Using Loops

Figure 4-12 has a structure similar to what one might expect from a hand-code RTL design. However, one important feature to note is that the control logic for this implementation is three bits wide. The reason for this is that the loop exit condition is “i<4”. This means that this loop only exits when “i>=4”, which requires at least three bits.

NOTE: The number of bits required for evaluating the loop exit condition is usually one bit larger than expected. This is because the loop iteration must increment before exiting.

Loop Unrolling

Loop unrolling is the primary mechanism to add parallelism into a design. This is done by automatically scheduling multiple loop iterations in parallel, when possible. The amount of parallelism is controlled by how many loop iterations are run in parallel. This is different than loop pipelining, which allows loop iterations to be started every II clock cycles. Loop unrolling can theoretically execute all loop iterations within a single clock cycle as long as there are no dependencies between successive iterations.

Partial Loop Unrolling

If we take Example 4-8 and unroll the ACCUM loop by a factor of two, this has the equivalent effect of manually duplicating the loop body two times and running the ACCUM loop for half as many iterations. Example 4-9 illustrates the effects of loop unrolling by showing the ACCUM loop of Example 4-8 manually unrolled two times.

Example 4-9. Manual Loop Unrolling - Unroll by 2

1  void accumulate(int din[4], int &dout){
2    int acc=0;
3  ACCUM:for(int i=0;i<4;i+=2){
4      acc += din[i];
5      acc += din[i+1];
6    }
7    dout = acc;
8  }

The details of Example 4-9 are:

•     line 3 increments the ACCUM loop by two, which means that the “partially unrolled” loop now has two iterations.
•     Lines 4 and 5 have duplicated the loop body two times, which shows that two accumulations are performed each iteration. It should be noted that the accumulate in Line 5 is dependent on the accumulate on line 4. For now it is assumed that there is still sufficient time to schedule both in the same clock cycle. Dependencies between loop iterations are discussed later.

Figure 4-13 shows the schedule of the loop iterations for Example 4-8 on page 47 when the ACCUM loop is unrolled by two. All four values are now accumulated in only two clock cycles.

Figure 4-13. Schedule for Accumulate Unroll by 2

Figure 4-14 shows the hardware implementation when unrolling by two. It can be seen that this design requires twice as many resources (adders) as the “rolled” version.

Figure 4-14. Hardware Implementation - Accumulate Unroll by 2

Fully Unrolled Loops

Taking Example 4-8 on page 47 and “fully” unrolling the ACCUM loop dissolves the loop and allows all iterations to be scheduled in the same clock cycle (Assuming that there is sufficient time to account for dependencies between iterations). The manual equivalent C++ of doing this is shown in Example 4-10.

Example 4-10. Manual Loop Unrolling - Fully Unrolled

void accumulate(int din[4], int &dout){
    int acc=0;

    acc += din[0];
    acc += din[1];
    acc += din[2];
    acc += din[3];

    dout = acc;

Figure 4-15 shows the schedule of the fully unrolled ACCUM loop. All four values are now accumulated in a single clock cycle.

Figure 4-15. Schedule for Accumulate - Fully Unrolled

Figure 4-16 shows the approximate hardware when fully unrolling the ACCUM loop.

Figure 4-16. Hardware Implementation - Accumulate with Fully Unrolled Loop

Dependencies Between Loop Iterations

NOTE: Unrolling a loop does not necessarily guarantee that the loop iterations are scheduled in the same c-step. Dependencies between iterations can limit parallelism.

The previous examples have assumed that there is sufficient time to ignore the effects of any dependencies between loop iterations. Thus Figure 4-15 shows all four iterations scheduled in the same clock cycle, but it does not show the dependencies that exist between iterations. A more accurate depiction of the schedule that includes the dependencies and component delays is shown in Figure 4-17. If the adders in Figure 4-17 were sufficiently slow it would be likely that second stage of the adder tree would be scheduled in the next clock cycle, increasing the design latency. However, if the design is pipelined with II=1 it is still possible to achieve a throughput of accumulating four values per clock cycle. Thus some dependencies between loop iterations do not limit design performance. However in many cases the dependencies between iterations limit performance or prevent pipelining. This is covered in detail in “Data Feedback” on page 73.

Figure 4-17. Schedule Dependencies

Loops with Constant Bounds

When writing loops for HLS it is important, when possible, to express them such that there is:

1.   A constant initialization of the loop iterator
2.   A test condition of the loop iterator against a constant value
3.   A constant increment of the loop iterator

Writing the loop in this fashion allows HLS to optimize the design by reducing the bit widths of control and data path signals that are based on the loop iterator. This is because the three conditions listed above are sufficient for determining the maximum number of loop iterations. This is desirable to be able to get accurate information about latency and throughput of a design.

The four cycle accumulator in Example 4-8 on page 47 is a good example of writing loops with constant bounds. The corresponding hardware implementation shown in Figure 4-12 on page 48 shows that the control logic is optimally reduced to three bits. The main point to take away from this example is that even though the loop iterator “i” was declared as a 32-bit integer, HLS is able to reduce the bit widths to the fewest possible bits because the loop was written with constant bounds.

Loops with Conditional Bounds

The previous section showed that optimal hardware can be inferred if a loop is written with unconditional bounds. However, it is often the case that an algorithm or design requires that a loop terminate early based on some variable that has been defined outside of the loop, or on the design interface. This is a perfectly reasonable thing to do, but the way this is written in the C++ code can have a dramatic impact on the quality of results as well as accurate reporting of latency and throughput.

The accumulator design used in Example 4-8 can be modified to illustrate the impact in quality-of-results when using a loop with conditional bounds. In order to make the accumulator more programmable the code is modified so that the accumulator can accumulate anywhere from one to four 32-bit values.

Example 4-11. Conditional Accumulate

1  #include “accum.h”
2  #include <ac_int.h>
3  void accumulate(int din[4], int &dout, unsigned  int ctrl){
4    int acc=0;
5  ACCUM:for(int i=0;i<ctrl;i++){
6      acc += din[i];
7    }
8    dout = acc;
9  }

The modified accumulator design shown above now uses the interface variable “ctrl” on line 5 to select the number of loop iterations to be one through four. Synthesizing this design reveals that there are several inefficiencies with the resulting hardware.

CAUTION: Having a variable as the loop upper or lower bound often results in the loop counter hardware being larger than needed

CAUTION: Having a variable as the loop upper bound requires one extra clock cycle to test the loop condition

CAUTION: Having an unconstrained bit width on the loop exit condition results in control logic larger than needed
Lets first examine the hardware that would be synthesized from the conditional accumulator shown in Example 4-11:

Figure 4-18. Loop With Conditional Bounds

The resulting hardware synthesized from the C++ accumulator with conditional loop bounds results in a 32-bit loop counter and 33-bit logic for the loop exit condition (Figure 4-18). The reason for this is that the interface variable “ctrl” is a 32 bit integer. Because “ctrl” is on the design interface, HLS has no way of knowing, or more importantly proving, that it only should ever range from one to four.

NOTE: This is an important lesson about HLS in that it only automatically reduces bit-widths where it can symbolically prove that it can be done without changing the functionality between the C++ code and the generated RTL.

In this example, the C++ specifies a 32-bit interface variable which requires 33-bit control logic to be functionally equivalent. The solution to the problem shown above requires two minor C++ code changes, and can be split into two parts; fixing the loop counter, and fixing the loop exit condition.

Optimizing the Loop Counter

In order for HLS to reduce the bit width of the loop counter the loop upper bound should be set to a constant. However, since the execution of each loop iteration is determined by the variable, “ctrl”, we need to add a mechanism for terminating the loop early. This is done by using a conditional break in the loop body shown in Example 4-12.

Example 4-12. Bounded Loop with Conditional Break
   1  #include “accum.h”
   2  #include <ac_int.h>
   3  void accumulate(int din[4], int &dout, int ctrl){
   4    int acc=0;
   5  ACCUM:for(int i=0;i<4;i++){
   6      acc += din[i];
   7      if(i>=ctrl-1)
   8        break;
   9    }
   10   dout = acc;
   11 }

The conditional break is placed at the end of the loop because it is assumed that there is at least one loop iteration. Having the conditional break at the end of the loop should give the best quality of results in general. However, if “ctrl” can be zero, meaning that the loop can have zero iterations, the break must be placed at the beginning of the loop body. Figure 4-19 shows the resulting hardware from Example 4-12.

Figure 4-19. Bounded Loop With Conditional Break

The code transformation has the effect of reducing the loop counter to three bits by fixing the upper loop bound to a constant. Unfortunately, the code change has actually made the design slightly larger. Putting the conditional break at the end of the loop has created a 33-bit subtractor to compute “ctrl-1” and a 34-bit subtractor to compute the “>=” operation. This is in part because “ctrl” is 32-bits and cannot be automatically reduced since it is on the design interface. The control logic can be further optimized.

NOTE: In general making the loop bounds constant produces better hardware. Conditional breaks can be used inside of the loop to give the same functionality as a variable loop bound.

Optimizing the Loop Control

There are two problems with the control logic for the loop exit condition of Example 4-12:

•     Use of a 32-bit integer on the design interface
•     Exit condition test requires a subtractor to compare against ctrl-1

HLS is not able to reduce the bit-widths on the top-level design interface for this design since it cannot prove that “ctrl” is always between one and four. In this case the designer must constrain the bit-width of “ctrl” to the desired number of bits. Native C++ data types do not give designers the ability to specify arbitrary bit widths on a variable so bit-accurate data types are required. A better way to write the code to optimize the loop control is shown in Example 4-13.

Example 4-13. Optimized Loop Control
   1  #include “accum.h”
   2  #include <ac_int.h>
   3  void accumulate(int din[4], int &dout, ac_int<3,false> ctrl){
   4    int acc=0;
   5    int i_old=0;
   6  ACCUM:for(int i=0;i<4;i++){
   7      acc += din[i];
   8      if(i_old==ctrl)
   9        break;
   10     i_old = i;
   11   }
   12   dout = acc;
   13 }

The following code changes were made to optimize the loop control logic:

1.   Line 3 - “ctrl” was constrained to three bits reducing the comparison logic to three bits
2.   Line 10 - “i_old” stores the previous value of the loop iterator “i”
3.   Line 8 - The exit condition test is made on the previous value of “i” eliminating the need for a subtractor.

The resulting hardware from Example 4-13 is shown in Figure 4-20.

Figure 4-20. Optimized Loop Control

NOTE: Interface variables should always be constrained to the minimum number of bits, especially when used as a loop control variable.

Next Week: Nested and pipelines loops

"Nested loops and the effects of pipelining nested loops is often one of the most misunderstood concepts of high-level C++ synthesis. Understanding the resulting hardware behavior from synthesizing non-pipelined and pipelined nested loops allows designers to more easily meet performance and area requirements."

Bookmark and Share

Insert your comment

Author Name(required):

Author Web Site:

Author email address(required):


Please Introduce Secure Code: