Optional Model Optimization for MXA#

At MemryX, we pride ourselves on supporting efficient inferencing of your NN models with no required modifications. Other inferencing accelerators may rely on model tuning tricks to increase support/performance such as: fine-tuning, retraining, quantization, pruning, or layer conversions. We strive to implement an out-of-the-box experience, compiling your model as-is, in order to minimize your development/deployment times.

However, like any hardware, there are certain properties of models that run more efficiently on the MemryX MXA. For experienced users who would like to squeeze the maximum inferencing performance out of our accelerator as possible, the following is a list of ‘optional recommendations’ which may lead to more efficient use of the accelerator hardware. This may manifest as higher inferencing performance (↑FPS, ↓latency), increased energy efficiency, and/or result in requiring fewer chips to map.

Tip: Use ReLU / ReLU-6 / ReLU-N activations if possible

The NeuralCompiler performs some amount of layer-fusion, allowing many layers to map to a single PE. When this happens, both compute power and feature map storage / bandwidth is reduced. Simple activations such as relu relu6 will likely fuse with the adjacent layers, whereas advanced activations like sigmoid swish tanh are unable to fuse.

Tip: Certain dimensions are more ‘friendly’ to the MXA

Higher utilzation of the PEs can be achieved if dimensions are kept at round numbers.

  • Keep feature map channels a multiple of 8

  • Convolutions with filters a multiple of 64 will be the most efficient

Tip: Some layers should be avoided if possible.

Some layers (while supported) can incur expensive decompositions on our hardware and should be avoided for maximum efficiency.

  • Reshapes that include the channel dimension

  • Transposes

  • Group-Convolutions

Tip: Many small layers are often times less efficient than a few larger layers.

The NeuralCompiler spatially (and statically) maps the computation graph onto our PEs, efficiently splitting large workloads (single layers) over many PEs. When there are many fewer layers, the mapping can become more difficult. Additionally, a model with more layers naturally incurs more intermediate featuremap memory space / bandwidth.

Note

We will continue to expand this page with more tricks and details.