Neural Compiler & Models#

Compiler Errors#

Resource Errors

Resource errors are identified by the prefix memryx.errors.ResourceError:, followed by a detailed message. This indicates insufficient resources on the target chip/module for compiling your neural network models.

Troubleshooting Steps:

Confirm you are targeting the correct number of chips on your module. For the M.2 module, the chip count is 4, which can be specified with the -c 4 argument. For further details on using the -c argument, refer to the Neural Compiler documentation.
Consider using the --autocrop argument to attempt running your model. Cropping pre-processing and/or post-processing layers may enable better mapping to the accelerator. More information on the --autocrop argument can be found in the Neural Compiler documentation.

Hint

Pre-processing and post-processing layers are much lighter in computational demand compared to the core of the model and can often be cropped without adversely affecting performance. These layers, being less resource-intensive, can be efficiently executed on the host through the Accelerator API.

Explore the possibility of utilizing more chips with the -c num_chips argument, but proceed with caution. Although the simulator can support any number of chips, it’s crucial to adhere to the hardware’s chip capacity. This method can provide insights into the necessary chip count for your model. Alternatively, compiling with -c min allows the compiler to determine the minimum required number of chips.

Model Accuracy#

How to Address Accuracy Degradation

Accuracy degradation could be caused by multiple issues, including:

A bug
Weights precision
Output channels precision
Other issues
- Using a pre-quantized model
- Ignoring “approximated operator” warning
- …etc.

We suggest that the user should follow the following troubleshooting steps:

1. Check the compiler warnings

To sort out operator approximation issues

2. Use the correct input and output data shapes

MemryX accelerators are using channel-last data format

3. Compile using unquantized models

To avoid double quantization errors

4. Check if your operators are supported

Check the supported operators list: Supported Operators

5. Use identical testing flow and post-processing (CPU vs. accelerator)

Use the same pre- and post-processing for both flows

6. Use accuracy metric rather than visual inspection whenever possible

If you train your models, use the accuracy metric you used while training.
If you use an off-the-shelf model, use standard accuracy metrics.

7. Run one model at a time on the accelerator (in the case of multi-model)

To figure out which model is injecting the error
The rest of the models should run on the host for the sake of testing.

8. Try double precision Weights

To sort out a weights-precision issue
Should be applied to unquantized models
Please check: Mixed-Precision Weights
If double precision is helping, you can use our new experimental auto-double-precision feature by compiling your model with the following flag: --exp_auto_dp

9. Try High Precision Output Channels (HPOC)

Can be used to enhance the precision of critical output channels (ex: bounding box coordinates)
Especially useful when output channels contain multiple types of data with varying dynamic range and/or post-processing (ex: bounding box coordinates and classification scores)
Please check: High Precision Output Channels

10. Crop your model

Will help to check if a given section of the model is causing the accuracy degradation
You can use the Neural Compiler manual cropping feature: Model Cropping

Model Performance#

How to Address Sub-Optimal Performance of Compiled Models

If a locally compiled model is benchmarking with lesser performance than expected, there are two potential causes:

The model is not optimally mapped to our architecture during compilation.

When compiling a model, the Neural Compiler has a flag called --effort which determines how many mapping points the compiler will generate to create the final DFP. By default, the compiler runs with --effort normal, meaning the first valid mapping point for a given model will be used to create the DFP, even if said mapping point is not the most optimal choice for performance.

If you wish to increase the performance for a given model, consider compiling with the --effort hard flag, which will generate multiple mapping points and choose the optimal one based on a cost function. Further performance gains can be obtained by using our closed-loop script, which will perform on-chip benchmarking for each generated mapping point.

Note that compiling with --effort hard and the closed-loop script will inrease compilation time significantly, but it will typically result in improved performance.

The host system contributes to the decreased performance.

Some host CPUs will contribute to decreased performance due to insufficient or slow compute resources. In these cases, performance generally cannot be improved without using an alternate host system or making changes at the application level. If you suspect that your host system is contributing to decreased performance, consider the following:

If your model was cropped during compilation, consider implementing custom pre-processing and/or post-processing code that is optimized with C++ or Python numpy. Running cropped portions of your model on the host system with the framework runtime (e.g., TensorFlow, ONNXrunntime) may add additional overhead that can be avoided with custom code.
Check your system’s PCIe configuration to ensure that the MemryX module is operating at maximum bandwidth. Refer to the Data Sheet for more information on the PCIe interface requirements. You should have 2 lanes of PCIe Gen3 with MSI-X interrupts for optimal performance.

If you are using a low-end ARM CPU, you may find that using FP32 packet format yields better FPS than the defaults. When compiling a model for such a system, you may want to try passing --input_format FP32 and --output_format FP32 to the Neural Compiler, and compare the FPS vs. the defaults using a benchmark tool (like acclBench).

If you are willing to use an alternate model, the following factors can yield significant improvements in performance without much loss in accuracy:

Consider using a model with a more MXA-friendly architecture.

YOLOv11 has minimal accuracy improvements over YOLOv9 and YOLOv8. Yet v8 and v9 have significantly better performance on MXA. A v11-nano would be worse in both accuracy and performance than a v8-small, for example.

Use a smaller model input resolution.

Models with smaller input resolutions will produce impressive improvements in performance.

From our Model Explorer page, observe the performance difference between two variants of YOLOv7-tiny with different input resolutions:

YOLOv7-tiny, 640x640x3: 269 FPS
YOLOv7-tiny, 320x320x3: 1162 FPS

Consider tweaking the model architecture.

YOLOv11’s architecture contains a new type block (Partial Self-Attention) that allows it to have fewer weights and run faster on a GPU, great for getting headlines and publishing papers. However, by simply adding more weights instead of the PSA block, one can achieve the same accuracy but with much better MXA performance. This is what the ‘MXA Optimized’ variants of YOLOv11 in the Model Explorer are doing. Contact MemryX if you would like us to assist you with deep model optimization.