Performance#

Possible Causes of Low Performance#

Performance degradation could be caused by multiple issues, including:

Insufficient M.2 cooling
Data packet format issues (Advanced Topic)
- Slow host CPU
- Single-stream decode capability

Insufficient Cooling#

Each MX3 chip on the M.2 must be kept below 100C, else it will start to thermal throttle by cutting its frequency by 1/2.

Please ensure you have strong enough airflow over the M.2 to avoid overheating.

The temperature and throttle status of each chip can be monitored in a Linux terminal with:

watch cat /sys/memx0/temperature

Data Packet Format#

Advanced Topic!

In most situations, the MX3 can operate at full speed with most host CPUs. However, for high FPS models with large input/output resolutions, you may encounter differences between benchmarks on your system vs. quoted Model Explorer benchmarks. Assuming the issue is not thermal-related, the different may be due to the packet format.

The MX3 has 3 modes of packetizing data between the host and the accelerator chips:

FP32 (float32)
BF16 (bfloat16)
GBFloat80 (custom MX3 format based on BF16) [output only]

The input (host to chip) and output (chip to host) formats are controllable throught the --input_format and --output_format arguments to mx_nc (see NeuralCompiler).

The Neural Compiler uses defaults that give the highest FPS on a moderately capable host CPU (either BF16 or GBFloat80). However, each format has tradeoffs between host CPU usage and PCIe bandwidth utilization.

In order from least CPU overhead to most overhead, the formats are: FP32 -> BF16 -> GBFloat80

And yet, the formats in order from least bandwidth usage to most expensive bandwidth usage are: GBFloat80 -> BF16 -> FP32.

Slow Host CPU#

If you are using a low-end ARM CPU, you may find that using FP32 packet format yields better FPS than the defaults. When compiling a model for such a system, you may want to try passing --input_format FP32 --output_format FP32 to the Neural Compiler and compare the FPS vs. the defaults (e.g., with C++ Benchmark).

Single-Stream Decode#

In cases with extremely high data transfer rate (e.g., YOLOv7-tiny 640 resolution), GBFloat80 output conversion may be a bottleneck even on a good x86 CPU. However, in multi-stream scenarios, each stream (e.g., a camera feed for VMS) will have its own GBFloat80 decoder. Thus you can verify if you are encountering this situation with:

acclBench -d [model.dfp]

Followed by:

acclBench -d [model.dfp] -n 2

If the total FPS was higher with -n 2, then you may have hit this situation.

If you intend to use this model in a VMS-style application with multiple streams, there is nothing you need to do other than use the Accelerator’s Auto-Threading API. It will have separate GBFloat80 decoders per stream.

If you are using a single-stream application, you can increase decoding throughput by calling set_parallel_fmap_convert to use 2 or more threads for your model(s). Please see this function’s entry in the C++ API.

Conversion Testing Tool#

We have an experimental tool for benchmarking your CPU’s format conversion capacity on Linux, for ARM and x86-64.

The “val/s” column is the amount of feature map data that your CPU can process per second (M = 10^6). Use this metric to decide which output format to use.

Note

Please be aware of the MX3 limitations section.

For example, if your CPU is capable of 9000 Mval/s FP32 and 1000 Mval/s of BF16, you want to choose BF16.

The MX3 is limited to 300 Mval/s of FP32 and 600 Mval/s of BF16. Since your CPU in this case is capable of rates well above these, the overall system will be limited to the MX3’s capacity.

Therefore, you would pick BF16 here to get 600 Mval/s.

Generally, the following are the best options:

Good CPU [most x86, high-end ARM]:

Input: BF16
Output: BF16 (small feature maps), or GBFloat80 (>=16 channel feature maps)

Weak CPU [Intel Atom, most ARM]:

Input: FP32
Output: FP32

But the best answer is to run the tool on your specific system, and choose the optimal Mval/s.