Runtime#

Module Issues#

Performance Issues#

Performance degradation could be caused by multiple issues, including:

Data Packet Format Issues (Advanced Topic)

Data Packet Format

In most situations, the MX3 can operate at full speed with most host CPUs. However, for high FPS models with large input/output resolutions, you may encounter differences between benchmarks on your system vs. quoted Model Explorer benchmarks. Assuming the issue is not thermal-related, the difference may be due to the packet format.

The MX3 has 3 modes of packetizing data between the host and the accelerator chips:

FP32 (float32)
BF16 (bfloat16)
GBFloat80 (custom MX3 format based on BF16) [output only]

The input (host to chip) and output (chip to host) formats are controllable throught the --input_format and --output_format arguments to mx_nc (see NeuralCompiler).

The Neural Compiler uses defaults that give the highest FPS on a moderately capable host CPU (either BF16 or GBFloat80). However, each format has tradeoffs between host CPU usage and PCIe bandwidth utilization.

In order from least CPU overhead to most overhead, the formats are: FP32 < BF16 < GBFloat80

The formats in order from least bandwidth usage to most expensive bandwidth usage are: GBFloat80 < BF16 < FP32.

Single-Stream vs Multi-Stream

In cases with extremely high data transfer rate (e.g., YOLOv7-tiny 640 resolution), GBFloat80 output conversion may be a bottleneck even on a good x86 CPU. However, in multi-stream scenarios, each stream (e.g., a camera feed for VMS) will have its own GBFloat80 decoder. Thus you can verify if you are encountering this situation with:

acclBench -d [model.dfp]

Followed by:

acclBench -d [model.dfp] -n 2

If the total FPS was higher with -n 2, then you may have hit this situation.

If you intend to use this model in a VMS-style application with multiple streams, there is nothing you need to do other than use the Accelerator’s Auto-Threading API. It will have separate GBFloat80 decoders per stream, alleviating the issue.

If you are using a single-stream application, you can increase decoding throughput by calling set_parallel_fmap_convert to use 2 or more threads for your model(s). Please see this function’s entry in the C++ API.

Conversion Testing Tool

For those suspecting performance issues related to format conversion, we have an experimental tool for benchmarking your CPU’s format conversion capacity on Linux, for ARM and x86-64.

The “val/s” column is the amount of feature map data that your CPU can process per second (M = 10^6). Use this metric to decide which output format to use.

Note

Please be aware of the MX3 limitations section.

For example, if your CPU is capable of 9000 Mval/s FP32 and 1000 Mval/s of BF16, you want to choose BF16.

The MX3 is limited to 300 Mval/s of FP32 and 600 Mval/s of BF16. Since your CPU in this case is capable of rates well above these, the overall system will be limited to the MX3’s capacity.

Therefore, you would pick BF16 here to get 600 Mval/s.

TL;DR Summary

Generally, the following are the best options:

Good CPU [most x86, high-end ARM]:

Input: BF16
Output: BF16 or GBFloat80 (let the NeuralCompiler decide)

Weak CPU [Intel Atom, most ARM]:

Input: FP32
Output: FP32