Image
image
image
image


Design Articles

Fundamentals of High Level Synthesis—Part 1

This article is excerpted from Chapter 4: Fundamentals of High Level Synthesis from the newly published book High Level Synthesis Blue Book by Michael Fingerhoff (Copyright 2010 by Mentor Graphics Corporation). New installments of this chapter will appear every Monday for the next four weeks. If you can't wait--or want to learn more--the book is available at Amazon.com.

By Michael Fingeroff, Mentor Graphics

Introduction

HLS book

One of the common misconceptions held by people is that synthesizing hardware from C++ provides users the freedom of expressing their algorithms using any style of C++ coding that they desire. When designing using high-level C++ synthesis, it is important to remember that we are still describing hardware using C++, and a “poor” description can lead to a sub-optimal RTL implementation. It is the responsibility of the user to code their C++ to not only account for the underlying memory architecture of a design, but to also adhere to the recommended coding style for C++ synthesis. Because of this it is important to have a solid understanding about what high level synthesis really does. This chapter attempts to cover the basics of high level synthesis, and to show what designers can expect from a given coding style. Where appropriate, the code examples are accompanied by hardware diagrams to hopefully allow RTL designers to relate C++ synthesis to concepts that are familiar to them.

The Top-level Design Module

Similar to RTL design, HLS requires that users specify where the “top” of their design is. This is where the design interfaces with the outside world and consists of port definitions, direction, and bit widths or in the case of C++, data types. Since we are still designing hardware, although using C++, it is helpful to see how this might relate to the world of RTL design. Consider the following simple Verilog RTL design that describes a d-type flip flop with asynchronous reset.

Example 4-1. Simple Verilog Design Module
module top(clk,arst,din,dout);

input clk;
input arst;
input [31:0] din;
output [31:0] dout;
reg [31:0] dout;

always@(posedge clk or posedge arst)
begin
  if(arst == 1’b1)
   dout = 1’b0;
  else
    dout = din;
end

endmodule

The verilog module shown in Example 4-1 contains several input ports and a single output port. The inputs are for clock, reset, and 32-bit input data, the output is for the 32-bit output data. The port directions and widths are explicitly defined in the Verilog source.

In order for HLS to determine the top-level design from the C++, the user must either specify a pragma in the source code or set a user constraint in the synthesis tool.

So the C++ equivalent description of the Verilog design described above would look like:

Example 4-2. Setting the Top-level Design
void top(int din, int& dout){
  dout = din;
}

Even this simple example illustrates how HLS can simplify the design process. The C++ description is very compact. Looking at the C++ description in Example 4-2 it is important to understand that there are several things that are implied in the code.

Registered Outputs
High-level synthesis by default builds synchronous designs. This means that all outputs of the top-level design are registered to guarantee that timing is met when connecting to another design. There are mechanisms to build smaller combinational blocks (automated component flows) but in general designs are synchronous.
Control Ports
The C++ code has no concept of timing so there are no clocks, enables, or resets described in the source. Instead these signals are added by the synthesis tool. Control over things like polarity, type of reset, etc, are taken care of by setting design constraints.

Port Width
For the simple case, meaning minimal interface constraints in synthesis, the bit widths of the top-level ports, excluding clock and reset, are implied by the data type. In the design example shown above the data type is “int” which implies 32 bits. Designers can describe any arbitrary bit width using bit-accurate data types.

Port Direction
The port direction is implied by how an interface variable is used in the C++ code

Input ports
An input port is inferred when an interface variable is only read. In the C++ example shown in Example 4-2 you can see that “din” is only ever read, so it is determined to be an input. If a variable is declared on the interface as “pass by value” it can only be an input. This is covered in more detail later.

Output ports
An output port is inferred in two cases. One is when the top-level function returns a value. The other is when the interface variable is only written in the C++ code. In Example 4-2 it can be seen that “dout” is only ever written. It can also be seen that “dout” is declared as a reference. A variable must be declared as a reference or a pointer in order to infer an output. This is also covered in more detail later.

Inout Ports
Although this design does not contain any inout ports, these are inferred if an interface variable is both read and written in the same design. This requires that the variable is declared as a reference or a pointer.

High-level C++ Synthesis

Although this style guide is not intended to be a tutorial on the intricacies of high-level synthesis optimizations, it is useful to present a brief overview of the basic process of automatically transforming un-timed algorithmic C++ into hardware. Understanding these fundamental concepts goes a great way towards providing a solid foundation for the material covered in later sections.

Similar to the rest of the style guide, these concepts are best illustrated using simple examples consisting of both C++ code and hardware design concepts. Consider the following C++ example that accumulates four integer values shown in Example 4-3.

Example 4-3. Simple C++ Accumulate
#include “accum.h”
void accumulate(int a, int b, int c, int d, int &dout){
  int t1,t2;

  t1 = a + b;
  t2 = t1 + c;
  dout = t2 + d;
}

Data Flow Graph Analysis

The process of high-level synthesis starts by analyzing the data dependencies between the various steps in the algorithm shown above. This analysis leads to a Data Flow Graph (DFG) description shown in Figure 4-1.

Figure 4-1. Data Flow Graph Description

Each node of the DFG represents an operation defined in the C++ code, for this example all operations use the “add” operator. The connection between nodes represents data dependencies and indicates the order of operations. This example shows that t1 must be computed before t2 [1].

Resource Allocation

Once the DFG has been assembled, each operation is mapped onto a hardware resource which is then used during scheduling. This is the process known as resource allocation. The resource corresponds to a physical implementation of the operator hardware. This implementation is annotated with both timing and area information which is used during scheduling. Any given operator may have multiple hardware resource implementations that each have different area/delay/latency trade-offs. The resources are selected from a technology specific pre-characterized library that contains sufficient data points to represent a wide range of bit widths and clock frequencies. Figure 4-2 shows the resource allocation of Figure 4-1. Each operation can potentially be allocated to a different resource. It is also typical that designers can explicitly control resource allocation to insert pipeline registers or limit the number of available resources.

Figure 4-2. Resource Allocation

Scheduling

High-level synthesis adds “time” to the design during the process known as “scheduling”. Scheduling takes the operations described in the DFG and decides when (in which clock cycle) they are performed. This has the effect of adding registers between operations based on a target clock frequency. This is similar to what RTL designers would call pipelining, by which they mean inserting registers to reduce combinational delays. This is not the same as “loop pipelining” which is discussed later.

If we assume that the “add” operation for the DFG of Figure 4-1 takes 3 ns out of a 5 ns clock cycle, the resulting schedule would look something like the schedule shown in Figure 4-3. Each add operation is scheduled in its own clock cycle C1, C2, and C3. Thus registers are inserted automatically between each adder.

Figure 4-3. Scheduled Design

A data path state machine (DPFSM) is created to control the scheduled design. The FSM for this example requires four states that correspond to the four clock cycles needed to execute the schedule shown above. In HLS these state are also referred to as control steps or c-steps.

Figure 4-4. Data Path State Diagram

The state diagram of the DPFSM is shown in Figure 4-4 and illustrates that the scheduled design is capable of producing a new output every four clock cycles. Once C4 has completed a new C1 begins.

The resulting hardware that is generated from the schedule shown in Figure 4-3 varies depending on how the design was constrained in terms of resource allocation as well as the amount of loop pipelining used on the design. Loop pipelining is covered in detail next but for now let’s assume that the design is unconstrained, which should allow the hardware to be realized with the minimum number of resources if sharing saves area. In this example that would be the minimum number of adders. The resulting hardware would look something like that shown in Figure 4-5.


Figure 4-5. Hardware Implementation of the Unconstrained Design

The resulting hardware shown in Figure 4-5 uses a single adder to accumulate a, b, c, and d. It should be noted that the data path is 32-bits wide because the variables have been declared as integer types.

Classic RISC Pipelining

The HLS concept of “Loop Pipelining” is similar to the classic RISC pipeline covered in most introductory computer architecture classes.

The basic five stage pipeline in a RISC architecture typically consists of Instruction Fetch(IF), Instruction Decode(ID), Execute(EX), Memory access(MA), and Register write back(WB) stages.

Figure 4-6. Five Stage RISC Pipeline

Figure 4-6 illustrates how a new instruction can be fetched each clock cycle while the other pipeline stages are gradually activated. The time it takes for all pipeline stages to become active is known as the pipeline “ramp up”. Once all pipeline stages are active the pipeline “ramp down” is the time it takes for all pipeline stages to become inactive. The difference between the RISC pipeline and HLS loop pipelining is that the RISC pipeline is designed to fetch and execute every clock cycle. A design that does not need to run every clock cycle under-utilizes the pipeline, and a design that needs to fetch and execute multiple times per clock cycle is not possible. HLS removes these restrictions and allows the pipeline to be custom built to meet the design specification.

Loop Pipelining

Similar to the RISC pipelining example described in Figure 4-6, which allows new instructions to be read before the current instruction has finished, “Loop Pipelining” allows a new iteration of a loop to be started before the current iteration has finished. Although in Example 4-3 there are no explicit loops, the top-level function call has an implied loop, also known as the main loop. Each iteration of the implied loop corresponds to execution of the schedule shown in Figure 4-3 on page 39.  “Loop pipelining” allows the execution of the loop iterations to be overlapped, increasing the design performance by running them in parallel. The amount of overlap is controlled by the “Initiation Interval (II)”. This also determines the number of pipeline stages.
 



NOTE: The Initiation Interval (II) is how many clock cycles are taken before starting the next loop iteration. Thus an II=1 means a new loop iteration is started every clock cycle.
 

The initiation interval is set on a desired loop either as a design constraint in the HLS design environment, or alternatively can be set using a C++ compiler pragma.
 
NOTE: Latency refers to the time, in clock cycles, from the first input to the first output.
 

NOTE: Throughput, not to be confused with IO throughput, refers to how often, in clock cycles, a function call can complete.
 



If the design of Example 4-3 on page 38 is left unconstrained there is only a single pipeline stage because there’s no overlap between execution of each iteration of the main loop. This results in data written every four clock cycles (Figure 4-7). The design has a latency of three and a throughput of four clock cycles. Because there is no overlap of any operation only a single adder is required if sharing reduces overall area.

Figure 4-7. No Pipelining, L=3, TP=4

If a pipelining constraint of II=3 is applied on the top-level design (main loop), then the next loop iteration can be started in C4 allowing writing of “dout” in C4 to be overlapped with the reading of the next values of “a” and “b”. The output is now written every three clock cycles while still requiring only one adder to implement the hardware (Figure 4-8). Only one pipeline stage is required since C4 is only used to allow the completion of the write on “dout”.
 


NOTE: The number of pipeline stages increases by one if other operations are scheduled in the last c-step.
 

Figure 4-8. Pipeline II=3, L=3, TP=3

Figure 4-9 shows that pipelining with an II=2 results in a new iteration started every two clock cycles. Iteration one is started in C3 while iteration 0 is computing “t3 = t2 + d”. Since iteration one is computing “t1 = a + b” it can be seen that two adders are required for the two pipeline stages.

Figure 4-9. Pipeline II=2, L=3, TP=2

Pipelining with an II=1 (Figure 4-10) results in a new iteration started every clock cycle. Iteration one is started in C2 and iteration 2 is started in C3. Looking at C3 in the Figure 4-10 shows that three adders are required in hardware since all three pipeline stages are active.

Figure 4-10. Pipelining II = 1, L=3, TP=1

Next week: Loops

Bookmark and Share


Insert your comment

Author Name(required):

Author Web Site:

Author email address(required):

Comment(required):

Please Introduce Secure Code:


image
image