Runtime#

Module Issues#

The module hangs when the application is started.

When the module begins to hang or idle, there are a few troubleshooting steps we recommend to address the issue:

  • Open the dmesg log to see which errors are being printed. You can run the following command to open the log:

sudo dmesg -w

If you see printed statements similar to the following:

memryx: fops_read: wait timeout 10(s), retrying again

Then the M.2 module is not receiving input data or is not producing output data. Proceed to the next step.

  • Keep the dmesg log open. Use another terminal window to stop and restart your application. If the module continues to hang, check the dmesg log for a printed error with the key word:

fw_ack: (...)

If you see this error, or if the initial timeout error persists, the host system may require a reboot in order to reset the module’s communication with the host. Proceed to the next step.

  • Reboot the host system. When it starts up, attempt to run the application again. If the code is implemented correctly, the module should no longer hang.

If you are still experiencing the issue, there may be an error in your application code. Check the input callback and output callback functions to make sure the accelerator is properly receiving input data and delivering output data.

Performance Issues#

Performance degradation could be caused by multiple issues, including:

Video Decoding/Encoding

If your application is reading from video files or RTSP streams, the CPU usage of video decoding from h264/h265 to raw frames is a very common bottleneck.

When possible, be sure to use hardware-accelerated video decoding/encoding on your platform to offload this work from the CPU, such as:

  • Intel QuickSync Video (QSV) on Intel CPUs / iGPUs

  • VA-API on AMD iGPUs

  • rkmpp on Rockchip SoCs like Orange Pi

Insufficient M.2 cooling

Insufficient Cooling

Each MX3 chip on the M.2 must be kept below 100C; otherwise, it will start to thermal throttle by reducing its frequency by 50%.

Please ensure you have strong enough airflow over the M.2 to avoid overheating.

The temperature and throttle status of each chip can be monitored in a Linux terminal with:

watch cat /sys/memx0/temperature

You can also monitor the temperature programmatically via the Runtime APIs.

Slow CPU or Pre/Post-Processing

If you have already tried optimizing the model compilation, you may want to check your pre/post-processing code.

Some host CPUs will contribute to decreased performance due to insufficient or slow compute resources. In these cases, performance generally cannot be improved without using an alternate host system or making changes at the application level. If you suspect that your host system is contributing to decreased performance, consider the following:

  • If your model was cropped during compilation, consider implementing custom pre-processing and/or post-processing code that is optimized with C++ or Python numpy. Running cropped portions of your model on the host system with the framework runtime (e.g., TensorFlow, ONNXrunntime) may add additional overhead that can be avoided with custom code.

  • Check your system’s PCIe configuration to ensure that the MemryX module is operating at maximum bandwidth. Refer to the Data Sheet for more information on the PCIe interface requirements. You should have 2 lanes of PCIe Gen3 with MSI-X interrupts for optimal performance.

Data Packet Format Issues (Advanced Topic)

Data Packet Format

In most situations, the MX3 can operate at full speed with most host CPUs. However, for high FPS models with large input/output resolutions, you may encounter differences between benchmarks on your system vs. quoted Model Explorer benchmarks. Assuming the issue is not thermal-related, the difference may be due to the packet format.

The MX3 has 3 modes of packetizing data between the host and the accelerator chips:

  1. FP32 (float32)

  2. BF16 (bfloat16)

  3. GBFloat80 (custom MX3 format based on BF16) [output only]

The input (host to chip) and output (chip to host) formats are controllable throught the --input_format and --output_format arguments to mx_nc (see NeuralCompiler).

The Neural Compiler uses defaults that give the highest FPS on a moderately capable host CPU (either BF16 or GBFloat80). However, each format has tradeoffs between host CPU usage and PCIe bandwidth utilization.

In order from least CPU overhead to most overhead, the formats are: FP32 < BF16 < GBFloat80

The formats in order from least bandwidth usage to most expensive bandwidth usage are: GBFloat80 < BF16 < FP32.

Single-Stream vs Multi-Stream

In cases with extremely high data transfer rate (e.g., YOLOv7-tiny 640 resolution), GBFloat80 output conversion may be a bottleneck even on a good x86 CPU. However, in multi-stream scenarios, each stream (e.g., a camera feed for VMS) will have its own GBFloat80 decoder. Thus you can verify if you are encountering this situation with:

acclBench -d [model.dfp]

Followed by:

acclBench -d [model.dfp] -n 2

If the total FPS was higher with -n 2, then you may have hit this situation.

If you intend to use this model in a VMS-style application with multiple streams, there is nothing you need to do other than use the Accelerator’s Auto-Threading API. It will have separate GBFloat80 decoders per stream, alleviating the issue.

If you are using a single-stream application, you can increase decoding throughput by calling set_parallel_fmap_convert to use 2 or more threads for your model(s). Please see this function’s entry in the C++ API.

Conversion Testing Tool

For those suspecting performance issues related to format conversion, we have an experimental tool for benchmarking your CPU’s format conversion capacity on Linux, for ARM and x86-64.

The “val/s” column is the amount of feature map data that your CPU can process per second (M = 10^6). Use this metric to decide which output format to use.

Note

Please be aware of the MX3 limitations section.

For example, if your CPU is capable of 9000 Mval/s FP32 and 1000 Mval/s of BF16, you want to choose BF16.

The MX3 is limited to 300 Mval/s of FP32 and 600 Mval/s of BF16. Since your CPU in this case is capable of rates well above these, the overall system will be limited to the MX3’s capacity.

Therefore, you would pick BF16 here to get 600 Mval/s.

TL;DR Summary

Generally, the following are the best options:

Good CPU [most x86, high-end ARM]:

  • Input: BF16

  • Output: BF16 or GBFloat80 (let the NeuralCompiler decide)

Weak CPU [Intel Atom, most ARM]:

  • Input: FP32

  • Output: FP32