Optional Model Optimization for MXA#
At MemryX, we pride ourselves on supporting efficient inferencing of your NN models with no required modifications. Other inferencing accelerators may rely on model tuning tricks to increase support/performance such as: fine-tuning, retraining, quantization, pruning, or layer conversions. We strive to implement an out-of-the-box experience, compiling your model as-is, in order to minimize your development/deployment times.
However, like any hardware, there are certain properties of models that run more efficiently on the MemryX MXA. For experienced users who would like to squeeze the maximum inferencing performance out of our accelerator as possible, the following is a list of ‘optional recommendations’ which may lead to more efficient use of the accelerator hardware. This may manifest as higher inferencing performance (↑FPS, ↓latency), increased energy efficiency, and/or result in requiring fewer chips to map.
Tip: Use ReLU / ReLU-6 / ReLU-N activations if possible
The NeuralCompiler
performs some amount of layer-fusion, allowing
many layers to map to a single PE. When this happens, both compute power and
feature map storage / bandwidth is reduced. Simple activations such as
relu relu6
will likely fuse with the adjacent layers, whereas
advanced activations like sigmoid swish tanh
are unable to fuse.
Tip: Certain dimensions are more ‘friendly’ to the MXA
Higher utilzation of the PEs can be achieved if dimensions are kept at round numbers.
Keep feature map channels a multiple of 8
Convolutions with filters a multiple of 64 will be the most efficient
Tip: Some layers should be avoided if possible.
Some layers (while supported) can incur expensive decompositions on our hardware and should be avoided for maximum efficiency.
Reshapes that include the channel dimension
Transposes
Group-Convolutions
Tip: Many small layers are often times less efficient than a few larger layers.
The NeuralCompiler spatially (and statically) maps the computation graph onto our PEs, efficiently splitting large workloads (single layers) over many PEs. When there are many fewer layers, the mapping can become more difficult. Additionally, a model with more layers naturally incurs more intermediate featuremap memory space / bandwidth.
Note
We will continue to expand this page with more tricks and details.