Before You Start#

The MemryX dataflow architecture is designed to achieve high-throughput inference performance. Since it is a “pipelined dataflow”–different from traditional CPU/GPU architectures–proper use of the MemryX Runtime is essential to achieve optimal performance when integrating into user applications.

This section will cover important background concepts that will help you understand how to use the MemryX Runtime most effectively.

Hint

If you’re already familiar with async and pipelined architectures, you can skip right to the Runtime Usage section.

Async Architecture#

The MemryX Runtime is designed to be asynchronous, meaning that the input and output functions are called completely separately from each other. In other words, the input function is called as soon as a new frame is available and should not wait for the output function to finish processing the previous frame.

This async code structure is done by having input and output functions in separate threads.

The difference between sync and async is further illustrated below:

Synchronous (CPU/GPU)#

Synchronous code is typically a loop that sequentially gets an input, processes it, handles the output, and then loops back to the top. This means that while the processing and output functions are running, the input function cannot gather the next frame until the next iteration of the loop.

while True:
    frame = get_next_frame()   # Input
    result = inference(frame)  # Processing
    display_result(result)     # Output
graph LR subgraph Thread-0 s1("get_next_frame()") s2("inference()") s3("display_result()") s1 --> s2 s2 --> s3 end style s1 fill:#FFE699, stroke:#595959 style s2 fill:#A9D18E, stroke:#595959 style s3 fill:#F9CB9C, stroke:#595959

On a pipelined dataflow architecture like MemryX’s, this code structure will lead to underutilization of the hardware and a massive drop in performance (FPS). Once the input frame is on the chip and begins processing, another is able to pushed in, but in this synchronous structure, the input function is blocked until the next iteration loop. So only 1 frame is present on the chip at a time.

Asynchronous (MemryX)#

Asynchronous send and receive threads, on the other hand, allow the input and output functions to run independently of each other. This means that while the chip and the output function are running, new inputs can continue to be sent to the chip without waiting for previous frame results.

# Input Thread
while True:
    frame = get_next_frame()   # Input
    send_input(frame)          # Send to Accelerator

#-----------------------

# Output Thread
while True:
    result = receive_output()  # Receive from Accelerator
    display_result(result)     # Output
flowchart TB subgraph COL1[" "] subgraph s7["Thread-0"] s1("get_next_frame()") s2("send_input()") end subgraph s8["Thread-1"] s4("receive_output()") s5("display_result()") end end chip("MX3 Hardware") s1 --> s2 s2 --> chip chip --> s4 s4 --> s5 style COL1 fill:#FFFFFF00, stroke:#FFFFFF00 style s1 fill:#FFE699, stroke:#595959 style s2 fill:#A9D18E, stroke:#595959 style s4 fill:#A9D18E, stroke:#595959 style s5 fill:#F9CB9C, stroke:#595959

With this structure, the input function can continue to gather frames and send them to the chip while the output function processes the results of previous frames. This allows for multiple frames to be present on the chip at once, filling the pipeline and leading to optimal performance.

In some other contexts, “async” might be used differently. Below is an example of an incorrect async structure that will not result in good performance on MemryX hardware:

Incorrect Async Example

In some contexts, the definition of “async” may only apply to the main thread. But for example, if the main thread is able to do other tasks while the hardware is processing, but still waits for the output to finish before sending the next input, this is not truly asynchronous.

The next input is still blocked until the previous output is finished.

Warning

Do not use this code structure, as it will lead to underutilization of the hardware and poor performance.

while True:
    frame = get_next_frame()   # Input
    send_input(frame)          # Send to Accelerator

    # main thread continues with other tasks here
    do_other_tasks()

    result = receive_output()  # Wait for Output to finish
    display_result(result)     # Output
graph TB subgraph COL1[" "] subgraph Thread-0 s1("get_next_frame()") s2("send_input()") s3("do_other_tasks()") s4("receive_output()") s5("display_result()") s1 --> s2 s2 --> s3 s3 --> s4 s4 --> s5 end end chip("MX3 Hardware") s2 --> chip chip --> s4 style COL1 fill:#FFFFFF00, stroke:#FFFFFF00 style s1 fill:#FFE699, stroke:#595959 style s2 fill:#A9D18E, stroke:#595959 style s3 fill:#349899, stroke:#595959 style s4 fill:#A9D18E, stroke:#595959 style s5 fill:#F9CB9C, stroke:#595959

In this structure, new inputs are only sent after the output function has finished processing the previous frame – the same problem as the synchronous code.

Hint

While this can be done manually (with MxAcclMT C++), the recommended way is to use MxAccl C++ or AsyncAccl Python, which automatically handles the threading for you. You will just have to define callback functions and the runtime will take care of the rest.

Latency and Throughput#

In async architectures, when evaluating performance of a system, it is important to understand latency and throughput and how they differ.

Definitions#

  • Latency is defined as the time it takes from a start point to an end point. - In-to-Out Latency is the time it takes from inputting a frame to receiving the output for that same frame. - Frame-to-Frame Latency is the time it takes from one output frame to the next output frame.

  • Throughput is the number of frames processed per second, abbreviated as FPS (frames per second). It refers to how many frames can be processed in a time period of 1 second.

Relationship Between Latency and Throughput#

Throughput, as the number of frames coming out of the chip per second, is the inverse of frame-to-frame latency:

throughput = 1 / frame_to_frame_latency

Since frame-to-frame latency is the time it takes from one output frame to the next (in units of seconds), the inverse of this latency gives the number of frames that can be processed in one second, which is the definition of throughput (FPS).

Sync Systems#

In a synchronous system, the in-to-out latency is equivalent to the frame-to-frame latency, because the input function is blocked until the output function finishes processing the previous frame.

Async Systems#

However, in an asynchronous system, mutliple frames may be “in-flight” at once, meaning that the in-to-out latency and frame-to-frame latency are not tied to each other.

For example, if the in-to-out latency is 100ms, but the frame-to-frame latency is 50ms, this means that while the first frame is streaming through the pipeline, the second frame can be inputted (after ~50ms). The pipeline thus has ~2 frames in-flight at a time. The FPS of this system would be 20 FPS (1/50ms) and not 10 FPS (1/100ms) – you will be receiving 20 output frames per second.

This is why on the Model Explorer page, we show measured FPS and both the in-to-out latency and frame-to-frame latency.

Important

An async pipeline with multiple in-flight frames is not the same as batching. Batching is a different concept where a batch or “bundle” of frames moves through a system as a single unit at the same time.

MemryX hardware always uses a batch size of 1. Each frame is processed independently, as soon as it is put into the pipeline – there is no collection of frames into batches.