Before You Start#

The MemryX dataflow architecture is designed to achieve high-throughput inference performance. Since it is a “pipelined dataflow”–different from traditional CPU/GPU architectures–proper use of the MemryX Runtime is essential to achieve optimal performance when integrating into user applications.

This section will cover important background concepts that will help you understand how to use the MemryX Runtime most effectively.

Hint

If you’re already familiar with async and pipelined architectures, you can skip right to the Runtime Usage section.

Async Architecture#

The MemryX Runtime is designed to be asynchronous, meaning that the input and output functions are called completely separately from each other. In other words, the input function is called as soon as a new frame is available and should not wait for the output function to finish processing the previous frame.

This async code structure is done by having input and output functions in separate threads.

The difference between sync and async is further illustrated below:

Synchronous (CPU/GPU)#

Synchronous code is typically a loop that sequentially gets an input, processes it, handles the output, and then loops back to the top. This means that while the processing and output functions are running, the input function cannot gather the next frame until the next iteration of the loop.

while True:
    frame = get_next_frame()   # Input
    result = inference(frame)  # Processing
    display_result(result)     # Output

graph LR subgraph Thread-0 s1("get_next_frame()") s2("inference()") s3("display_result()") s1 --> s2 s2 --> s3 end style s1 fill:#FFE699, stroke:#595959 style s2 fill:#A9D18E, stroke:#595959 style s3 fill:#F9CB9C, stroke:#595959

On a pipelined dataflow architecture like MemryX’s, this code structure will lead to underutilization of the hardware and a massive drop in performance (FPS). Once the input frame is on the chip and begins processing, another is able to pushed in, but in this synchronous structure, the input function is blocked until the next iteration loop. So only 1 frame is present on the chip at a time.

Asynchronous (MemryX)#

Asynchronous send and receive threads, on the other hand, allow the input and output functions to run independently of each other. This means that while the chip and the output function are running, new inputs can continue to be sent to the chip without waiting for previous frame results.

# Input Thread
while True:
    frame = get_next_frame()   # Input
    send_input(frame)          # Send to Accelerator

#-----------------------

# Output Thread
while True:
    result = receive_output()  # Receive from Accelerator
    display_result(result)     # Output

flowchart TB subgraph COL1[" "] subgraph s7["Thread-0"] s1("get_next_frame()") s2("send_input()") end subgraph s8["Thread-1"] s4("receive_output()") s5("display_result()") end end chip("MX3 Hardware") s1 --> s2 s2 --> chip chip --> s4 s4 --> s5 style COL1 fill:#FFFFFF00, stroke:#FFFFFF00 style s1 fill:#FFE699, stroke:#595959 style s2 fill:#A9D18E, stroke:#595959 style s4 fill:#A9D18E, stroke:#595959 style s5 fill:#F9CB9C, stroke:#595959

With this structure, the input function can continue to gather frames and send them to the chip while the output function processes the results of previous frames. This allows for multiple frames to be present on the chip at once, filling the pipeline and leading to optimal performance.

In some other contexts, “async” might be used differently. Below is an example of an incorrect async structure that will not result in good performance on MemryX hardware:

Hint

While this can be done manually (with MxAcclMT C++), the recommended way is to use MxAccl C++ or AsyncAccl Python, which automatically handles the threading for you. You will just have to define callback functions and the runtime will take care of the rest.

Latency and Throughput#

In async architectures, when evaluating performance of a system, it is important to understand latency and throughput and how they differ.

Definitions#

Latency is defined as the time it takes from a start point to an end point.
- In-to-Out Latency is the time it takes from inputting a frame to receiving the output for that same frame.
- Frame-to-Frame Latency is the time it takes from one output frame to the next output frame.
Throughput is the number of frames processed per second, abbreviated as FPS (frames per second). It refers to how many frames can be processed in a time period of 1 second.

Relationship Between Latency and Throughput#

Throughput, as the number of frames coming out of the chip per second, is the inverse of frame-to-frame latency:

throughput = 1 / frame_to_frame_latency

Since frame-to-frame latency is the time it takes from one output frame to the next (in units of seconds), the inverse of this latency gives the number of frames that can be processed in one second, which is the definition of throughput (FPS).

Sync Systems#

In a synchronous system, the in-to-out latency is equivalent to the frame-to-frame latency, because the input function is blocked until the output function finishes processing the previous frame.

Input         Processing        Output

[ 1 ]  -->  [            ]  -->  [   ]
[ 2 ]  -->  [     1      ]  -->  [   ]
[ 3 ]  -->  [     2      ]  -->  [ 1 ]
[ 4 ]  -->  [     3      ]  -->  [ 2 ]  <-\
[ 5 ]  -->  [     4      ]  -->  [ 3 ]  <-- Frame-to-Frame

          ^-- In-to-Out --^

In-to-Out: 100ms
Frame-to-Frame: 100ms
Throughput: 1/100ms = 10 FPS

Async Systems#

However, in an asynchronous system, mutliple frames may be “in-flight” at once, meaning that the in-to-out latency and frame-to-frame latency are not tied to each other.

This is why on the Model Explorer page, we show measured FPS and both the in-to-out latency and frame-to-frame latency.

Input         Processing        Output

[ 1 ]  -->  [            ]  -->  [   ]
[ 2 ]  -->  [   1        ]  -->  [   ]
[ 3 ]  -->  [   2    1   ]  -->  [   ]
[ 4 ]  -->  [   3    2   ]  -->  [ 1 ]
[ 5 ]  -->  [   4    3   ]  -->  [ 2 ]  <-\
[ 6 ]  -->  [   5    4   ]  -->  [ 3 ]  <-- Frame-to-Frame

          ^-- In-to-Out --^

In-to-Out: 100ms
Frame-to-Frame: 50ms
Throughput: 1/50ms = 20 FPS

For example, if the in-to-out latency is 100ms, but the frame-to-frame latency is 50ms, this means that while the first frame is streaming through the pipeline, the second frame can be inputted (after ~50ms). The pipeline thus has ~2 frames in-flight at a time. The FPS of this system would be 20 FPS (1/50ms) and not 10 FPS (1/100ms) – you will be receiving 20 output frames per second.

Important

An async pipeline with multiple in-flight frames is not the same as batching. Batching is a different concept where a batch or “bundle” of frames moves through a system as a single unit at the same time.

Input             Processing         Output

[(1,2,3)]  -->  [             ]  -->  [       ]
[(4,5,6)]  -->  [   (1,2,3)   ]  -->  [       ]
[(7,8,9)]  -->  [   (4,5,6)   ]  -->  [(1,2,3)]  <-\
[(A,B,C)]  -->  [   (7,8,9)   ]  -->  [(4,5,6)]  <-- Frame(batch)-to-Frame(batch)

               ^-- In-to-Out --^

In-to-Out: 300ms
Frame(batch)-to-Frame(batch): 300ms
Throughput: 1/300ms * 3 frames/batch = 10 FPS

MemryX hardware always uses a batch size of 1, as in the Async example earlier on this page. Each frame is processed independently, as soon as it is put into the pipeline – there is no collection of frames into batches.