Loop Unrolling, Pipelining, and Hardware Scheduling
A deep dive into loop unrolling, pipelining pragmas, and hardware scheduling in SystemC High-Level Synthesis (HLS).
How to Read This Lesson
For synthesis, the question changes from 'can C++ run this?' to 'can hardware be built from this?' Keep storage, timing, and static structure in your head as you read.
Loop Unrolling and Pipelining in HLS
In standard C++ software, loops execute sequentially on a CPU. You don't have to worry about how long they take in terms of "clock cycles," only their general algorithmic complexity (O(N)).
In High-Level Synthesis (HLS), however, C++ loops are physically transformed into silicon. The way you write your loop—and specifically where you place wait() statements—dictates whether the HLS tool generates massive parallel combinational logic, sequential state machines, or optimized hardware pipelines.
It is crucial to understand the difference between how the Accellera SystemC Simulation Kernel treats a loop and how an HLS Compiler treats it.
Source and LRM Trail
For synthesis, use Docs/LRMs/SystemC_Synthesis_Subset_1_4_7.pdf as the primary contract and Docs/LRMs/SystemC_LRM_1666-2023.pdf for base SystemC semantics. Source internals explain simulation behavior, but synthesizability is a tool contract: focus on static structure, reset modeling, wait placement, and bounded loops.
The Kernel Reality vs. The HLS Compiler
When you compile a SystemC model with GCC or Clang and link against the Accellera kernel, your loop is just a standard C++ loop. It executes sequentially on your host machine's CPU. If there is no wait(), the loop runs to completion in a single delta cycle, blocking the cooperative scheduler (sc_simcontext::crunch()). If there is a wait(), the sc_thread_process saves its stack to a coroutine (QuickThreads/pthreads) and yields control back to the scheduler, to be resumed on the next clock edge.
An HLS compiler (like Siemens Catapult or Cadence Stratus) behaves very differently. It parses the Abstract Syntax Tree (AST) of your C++ code. It uses the wait() statements as explicit register boundaries to slice your C++ code into a Finite State Machine (FSM).
1. Loops Without wait(): Unrolling and Combinational Logic
If you write a for or while loop that does not contain a wait() statement, you are instructing the HLS compiler that all iterations of this loop must execute in the same clock cycle.
// Executed entirely within one clock cycle
int sum = 0;
for(int i = 0; i < 4; i++) {
sum += data[i];
}
result.write(sum);
wait(); // Clock edge occurs HERETo achieve this physically, the HLS tool must unroll the loop. It flattens the AST, creating four separate adders in hardware and chaining them together as pure combinational logic.
- The Catch: If your loop iterates 10,000 times, the tool will try to generate 10,000 adders in a massive combinational chain. This will fail physical timing constraints (the clock period). Therefore, loops without
wait()must have a small, statically determinable number of iterations.
2. Loops With wait(): Sequential Execution
If you place a wait() inside the loop, the HLS tool slices the AST at that boundary, generating an FSM state transition.
int sum = 0;
for(int i = 0; i < 4; i++) {
sum += data[i];
wait(); // Clock edge occurs on EVERY iteration
}
result.write(sum);In this case, the HLS tool only needs to generate one physical adder. On clock cycle 1 (State 1), it adds data[0]. On cycle 2 (State 2), it adds data[1]. The loop will take exactly 4 clock cycles to complete, saving massive silicon area at the cost of latency.
Tool-Specific Pragmas: Unrolling and Pipelining
Because SystemC is standard C++, it doesn't have native language keywords for hardware micro-architecture. EDA vendors provide compiler directives (#pragma) to control exactly how the AST is transformed.
#pragma HLS UNROLL: Tells the compiler to explicitly replicate the hardware logic for the loop body. You can specify a factor (e.g.,factor=2) to partially unroll a loop, balancing area and speed.#pragma HLS PIPELINE: Rather than waiting for the entire loop iteration to finish, pipelining creates shift registers in the datapath, starting the next iteration of the loop while the current iteration is still executing. The time between starting consecutive iterations is known as the Initiation Interval (II).
Synthesis Subset LRM Restrictions
When dealing with loops, the SystemC Synthesis Subset 1.4.7 mandates:
- Static Bounds for Unrolling: If a loop contains no
wait()(meaning it must be completely unrolled into combinational logic), the number of iterations must be statically determinable at compile time. You cannot use a dynamically changing port value as the termination condition for a loop without await(). - No
wait()in functions: Generally, if a helper function contains await(), it must be inlined into the parent thread, and the parent thread's FSM scheduling is affected.
End-to-End Example: A Dot Product Unit
Below is a complete, compilable SystemC model of a Dot Product unit. The loop inside compute_thread lacks a wait(), making it a prime candidate for Loop Unrolling by an HLS compiler.
#include <systemc.h>
// -------------------------------------------------------------------------
// Synthesizable Hardware Module
// -------------------------------------------------------------------------
SC_MODULE(DotProductUnit) {
sc_in<bool> clk;
sc_in<bool> rst_n;
// Arrays of ports for input vectors
sc_in<int> a[4];
sc_in<int> b[4];
sc_in<bool> start;
sc_out<int> result;
sc_out<bool> valid;
void compute_thread() {
// --- RESET BLOCK ---
result.write(0);
valid.write(false);
wait();
// --- FUNCTIONAL BLOCK ---
while (true) {
if (start.read()) {
int sum = 0;
// --- LOOP UNROLLING CANDIDATE ---
// Because there is no wait() inside this loop, the HLS compiler
// will fully unroll this, generating 4 parallel multipliers
// and an adder tree that executes in a single clock cycle.
//
// Example vendor directive:
// #pragma HLS UNROLL
for (int i = 0; i < 4; i++) {
sum += a[i].read() * b[i].read();
}
result.write(sum);
valid.write(true);
} else {
valid.write(false);
}
wait(); // End of the clock cycle state
}
}
SC_CTOR(DotProductUnit) {
SC_CTHREAD(compute_thread, clk.pos());
async_reset_signal_is(rst_n, false);
}
};
// -------------------------------------------------------------------------
// Testbench / Simulation
// -------------------------------------------------------------------------
int sc_main(int argc, char* argv[]) {
sc_clock clk("clk", 10, SC_NS);
sc_signal<bool> rst_n;
sc_signal<bool> start;
sc_signal<int> a[4];
sc_signal<int> b[4];
sc_signal<int> result;
sc_signal<bool> valid;
// Instantiate and bind
DotProductUnit dut("dut");
dut.clk(clk);
dut.rst_n(rst_n);
dut.start(start);
for(int i = 0; i < 4; ++i) {
dut.a[i](a[i]);
dut.b[i](b[i]);
}
dut.result(result);
dut.valid(valid);
// Initialization
rst_n.write(false); // Assert reset
start.write(false);
for(int i = 0; i < 4; ++i) {
a[i].write(0);
b[i].write(0);
}
sc_start(15, SC_NS);
rst_n.write(true); // Release reset
// Test Case: Provide vector data
std::cout << "@" << sc_time_stamp() << " Feeding inputs..." << std::endl;
for(int i = 0; i < 4; ++i) {
a[i].write(i + 1); // Vector A: [1, 2, 3, 4]
b[i].write(2); // Vector B: [2, 2, 2, 2]
}
start.write(true);
// Step one clock cycle to capture inputs
sc_start(10, SC_NS);
start.write(false);
// Step one more clock cycle to propagate outputs
sc_start(10, SC_NS);
// Expected: (1*2) + (2*2) + (3*2) + (4*2) = 2 + 4 + 6 + 8 = 20
std::cout << "@" << sc_time_stamp() << " Result: " << result.read()
<< " (Expected 20)" << std::endl;
std::cout << "Valid: " << (valid.read() ? "true" : "false") << std::endl;
return 0;
}By carefully managing loops and wait() statements according to the Synthesis Subset LRM, you retain absolute control over whether your SystemC algorithm is synthesized into parallel combinational hardware or a sequential FSM, even though the Accellera kernel executes them all identically as software.
Deep Dive: Accellera Source for sc_signal and update()
The sc_signal<T> channel perfectly illustrates the Evaluate-Update paradigm of SystemC. In the Accellera source (src/sysc/communication/sc_signal.cpp), sc_signal inherits from sc_prim_channel.
The write() Implementation
When you call write(const T&), the signal does not immediately change its value. Instead, it stores the requested value in m_new_val and registers itself with the kernel:
template<class T>
inline void sc_signal<T>::write(const T& value_) {
if( !(m_new_val == value_) ) {
m_new_val = value_;
this->request_update(); // Inherited from sc_prim_channel
}
}The request_update() call appends the channel to sc_simcontext::m_update_list.
The update() Phase
After the Evaluate phase finishes (all ready processes have run), the kernel iterates over m_update_list and calls the update() virtual function on each primitive channel. For sc_signal, this looks like:
template<class T>
inline void sc_signal<T>::update() {
if( !(m_new_val == m_cur_val) ) {
m_cur_val = m_new_val;
m_value_changed_event.notify(SC_ZERO_TIME); // Notify processes sensitive to value_changed_event()
}
}This guarantees that all concurrent processes see the same old value until the delta cycle advances, perfectly mimicking hardware register delays.
Comments and Corrections