Chapter 4: TLM and Platforms

TLM Performance: DMI, Quantum Tuning, and Payload Discipline

A senior-level guide to making TLM virtual platforms fast without breaking timing, ordering, or debug behavior.

How to Read This Lesson

For TLM, resist the temptation to picture pins. Picture a C++ function call carrying a transaction object, then add timing only where the architectural question needs it.

TLM Performance: DMI, Quantum Tuning, and Payload Discipline

TLM performance is not one trick. It is a set of disciplined choices: transport style, payload reuse, direct memory interface, temporal decoupling, report cost, and how much timing fidelity the use case really needs.

Source and LRM Trail

For TLM, use the IEEE 1666 TLM clauses in Docs/LRMs/SystemC_LRM_1666-2023.pdf as the portable contract. Then inspect .codex-src/systemc/src/tlm_core/tlm_2: tlm_generic_payload, tlm_fw_transport_if, tlm_bw_transport_if, tlm_initiator_socket, tlm_target_socket, tlm_dmi, and tlm_quantumkeeper.

DMI and Payload Reuse Example

DMI (Direct Memory Interface) lets an initiator bypass repeated socket calls for memory-like regions. It is most valuable for RAM and ROM. Generic payloads should also be reused to avoid constant heap allocation.

Here is a full compilable example demonstrating DMI and payload reuse:

#include <systemc>
#include <tlm>
#include <tlm_utils/simple_initiator_socket.h>
#include <tlm_utils/simple_target_socket.h>
 
using namespace sc_core;
 
SC_MODULE(FastMemory) {
  tlm_utils::simple_target_socket<FastMemory> socket{"socket"};
  unsigned char memory[1024];
 
  SC_CTOR(FastMemory) {
    socket.register_b_transport(this, &FastMemory::b_transport);
    socket.register_get_direct_mem_ptr(this, &FastMemory::get_direct_mem_ptr);
  }
 
  void b_transport(tlm::tlm_generic_payload& trans, sc_time& delay) {
    // Normal slow-path transport
    trans.set_response_status(tlm::TLM_OK_RESPONSE);
    trans.set_dmi_allowed(true); // Hint to initiator that DMI is available
  }
 
  bool get_direct_mem_ptr(tlm::tlm_generic_payload& trans, tlm::tlm_dmi& dmi_data) {
    // Grant DMI access to the entire 1KB memory
    dmi_data.allow_read_write();
    dmi_data.set_dmi_ptr(memory);
    dmi_data.set_start_address(0);
    dmi_data.set_end_address(1023);
    dmi_data.set_read_latency(SC_ZERO_TIME);
    dmi_data.set_write_latency(SC_ZERO_TIME);
    return true;
  }
};
 
SC_MODULE(OptimizedInitiator) {
  tlm_utils::simple_initiator_socket<OptimizedInitiator> socket{"socket"};
  tlm::tlm_generic_payload reused_payload; // Payload reuse
  
  unsigned char* dmi_ptr = nullptr;
  uint64_t dmi_start = 0, dmi_end = 0;
  bool dmi_valid = false;
 
  SC_CTOR(OptimizedInitiator) { SC_THREAD(run); }
 
  void run() {
    uint64_t addr = 0x10;
    
    // First attempt: try DMI directly
    if (dmi_valid && addr >= dmi_start && addr <= dmi_end) {
      dmi_ptr[addr - dmi_start] = 0xAA;
      return;
    }
 
    // Slow path: configure reused payload
    unsigned char data = 0xAA;
    reused_payload.set_command(tlm::TLM_WRITE_COMMAND);
    reused_payload.set_address(addr);
    reused_payload.set_data_ptr(&data);
    reused_payload.set_data_length(1);
    reused_payload.set_response_status(tlm::TLM_INCOMPLETE_RESPONSE);
 
    sc_time delay = SC_ZERO_TIME;
    socket->b_transport(reused_payload, delay);
 
    // Check if target hinted at DMI
    if (reused_payload.is_dmi_allowed()) {
      tlm::tlm_dmi dmi_data;
      if (socket->get_direct_mem_ptr(reused_payload, dmi_data)) {
        dmi_valid = true;
        dmi_ptr = dmi_data.get_dmi_ptr();
        dmi_start = dmi_data.get_start_address();
        dmi_end = dmi_data.get_end_address();
        SC_REPORT_INFO(name(), "DMI successfully established!");
      }
    }
  }
};
 
int sc_main(int argc, char* argv[]) {
  OptimizedInitiator init("init");
  FastMemory mem("mem");
  init.socket.bind(mem.socket);
  sc_start();
  return 0;
}

DMI Safety

Targets should grant DMI only when direct access is safe (e.g. memory is contiguous, invalidation is implemented). Routers must translate DMI ranges. If a target grants a local address window, the router returns the corresponding system address window.

Extensions and Report Cost

Project policy should define who owns extensions and when they are cleared. Avoid format strings in hot paths unless tracing is enabled. Do not format complex messages that will never be displayed.

Expert Checklist

A performant TLM VP should:

  • use blocking transport for simple memory-mapped software access
  • use DMI for RAM/ROM
  • use temporal decoupling only with clear synchronization policy
  • reuse payloads in hot initiators
  • avoid dynamic allocation in per-transaction paths
  • set response status every time

Exhaustive Deep Dive: IEEE 1666-2023 LRM and Accellera TLM Source Implementation

Performance in TLM-2.0 is not magic; it is a meticulously architected set of bypass mechanisms and synchronization policies codified in the IEEE 1666-2023 LRM. The two most critical optimizations are the Direct Memory Interface (DMI) and Temporal Decoupling. Let's trace both the LRM mandates and their exact source implementations.

DMI: LRM Section 11 and tlm_dmi

The Direct Memory Interface (LRM Section 11) exists because function calls through sockets for every byte of memory access are too slow for instruction-set simulators (ISS) booting Linux.

LRM Clause 11.2 defines the tlm_dmi class. This is not a transaction payload; it is a metadata container granting a direct pointer to a contiguous block of memory. When you inspect src/tlm_core/tlm_2/tlm_dmi/tlm_dmi.h in the Accellera source, the class boils down to:

class tlm_dmi {
    unsigned char* m_dmi_ptr;
    sc_dt::uint64  m_start_address;
    sc_dt::uint64  m_end_address;
    dmi_access_e   m_granted_access; // TLM_DMI_READ, TLM_DMI_WRITE, etc.
    sc_core::sc_time m_read_latency;
    sc_core::sc_time m_write_latency;
    // ...
};

When an initiator calls get_direct_mem_ptr, it is asking the target to populate this object. If successful, the initiator extracts m_dmi_ptr. From that moment on, the initiator completely bypasses SystemC:

// Ultimate fast-path: Pure C++ array access
dmi_ptr[address - dmi_start_address] = data;

This bypasses b_transport, the scheduler, payload creation, and sockets.

Invalidation (LRM 11.2.5): If a target needs to remap memory or change access rights, it must call invalidate_direct_mem_ptr on its socket. This propogates backward to all initiators, forcing them to flush their cached tlm_dmi pointers and fall back to b_transport.

Temporal Decoupling and the Quantum Keeper (LRM Section 12)

If an initiator uses DMI to execute 1,000 instructions, it takes 0 delta cycles in SystemC. If it calls wait() after every instruction, the SystemC scheduler becomes the bottleneck. Temporal Decoupling (LRM 12.1) solves this.

Instead of yielding to the kernel, the initiator accumulates a local "time offset". It runs ahead of the official simulation time sc_time_stamp(). The LRM introduces the Global Quantum (LRM 12.2.3), which represents the maximum time an initiator is allowed to run ahead before it must yield to allow other threads to catch up.

In src/tlm_core/tlm_2/tlm_quantum/tlm_global_quantum.h, the global quantum is implemented as a singleton:

class tlm_global_quantum {
public:
    static tlm_global_quantum& instance();
    void set( const sc_core::sc_time& t );
    const sc_core::sc_time& get() const;
    // ...
};

To manage local offsets safely, the LRM recommends the Quantum Keeper (LRM 12.2.4), implemented in src/tlm_core/tlm_2/tlm_quantum/tlm_quantumkeeper.h. An initiator uses it like this:

tlm_utils::tlm_quantumkeeper m_qk;
 
void run() {
    while(true) {
        // Execute instruction
        m_qk.inc(sc_time(10, SC_NS)); 
        
        // If local time exceeds the global quantum, yield!
        if (m_qk.need_sync()) {
            m_qk.sync(); // Calls wait(m_qk.get_local_time())
        }
    }
}

Inside tlm_quantumkeeper::sync(), the keeper calls SystemC's wait(m_local_time) and then resets the local time offset to zero. This dramatically reduces the number of context switches. Instead of switching threads 10,000 times for 10,000 clock cycles, the threads switch only once per quantum (e.g., every 10,000 cycles).

Payload Allocation Discipline (LRM Section 14.2.4)

Even without DMI, memory allocation destroys performance. Profiling a naive TLM model often reveals that new tlm_generic_payload consumes 80% of execution time. The Accellera reference implementation provides tlm_utils::peq_with_cb_and_phase and tlm_mm_interface to help, but the golden rule for blocking transport (b_transport) is simple: Allocate once, reuse infinitely.

An initiator should instantiate a single tlm_generic_payload as a class member, configure only the fields that change (command, address, data pointer), and send it. For non-blocking transport, where payloads fly concurrently, you must implement a Memory Manager (tlm_mm_interface). The core source src/tlm_core/tlm_2/tlm_generic_payload/tlm_gp.cpp shows that acquire() and release() literally just increment and decrement an internal m_ref_count. When it hits zero, the payload calls m_mm->free(this) instead of delete, allowing the memory manager to return the payload to an sc_pool for reuse.

Mastering DMI, the Quantum Keeper, and Payload memory management forms the triad of TLM virtual platform performance.

Comments and Corrections