FAQ#

Install#

How do I install the SDK on Windows?

If you are on Windows, you can use WSL to install Ubuntu Linux as an app on your computer. Once installed, use it to follow the instructions in the Install section.

Note that if you haven’t used WSL before, you will need to install some basic packages first:

$ sudo apt update
$ sudo apt install python3.9-venv python3-pip

Please note that WSL enables the use of our Neural Compiler, Simulator, and DFP Inspect tools only. To use our runtime packages, you must install natively on Windows. As a reminder, we support Python versions 3.9, 3.10, 3.11, and 3.12. Any of these versions can be used for the package installation step above.

To use graphical tools (e.g., mx_viewer), also install:

$ sudo apt install qt5dxcb-plugin python3-tk
How do I install the SDK on Mac?

If you are on macOS, it’s a little more complicated. You’ll need to either use a lima-vm or (if on an Intel Mac) set up a VM using VirtualBox, VMWare, or Parallels.

Then login to Linux and follow the Install steps.

Why is it recommended to use a virtual environment?

For the installation of our Python tools and API, we recommend the use of a virtual environment to avoid any conflicting dependencies or pre-installed packages on your host system. The use of Python virtual environments is generally good practice because it allows a user to maintain separation between environments with different version requirements, dependencies, etc.

Ultimately, the use of virtual environments is optional but recommended for the reasons stated above.

Hardware#

Does the M.2 module have DRAM?

No, the M.2 module does not have DRAM. Rather, it utilizes what we refer to as distributed, on-chip memory using SRAM. Please refer to our Architecture Whitepaper for a more detailed description of our on-chip memory.

Why isn’t the M.2 module being recognized after installation?

There are a few reasons the M.2 module may not be recognized by your system after installation. The troubleshooting page linked here should provide some insights and advice for getting past this issue.

Which PCIe Gens are supported, and why?

Our M.2 module is compatible with PCIe Gen 3 and above. The module requires MSI-X (Message Signaled Interrupts - Extended) to function properly. MSI-X supports a large number of interrupt vectors—typically up to 2048—which is critical for handling the high parallelism and real-time responsiveness needed in AI workloads.

In contrast, PCIe Gen 2 systems do not support MSI-X, and only some support the older MSI standard (which is limited to 32 interrupt vectors).

Starting with PCIe Gen 3, support for MSI-X became mandatory, ensuring the level of interrupt handling we rely on.

For this reason, PCIe Gen 3 or newer is required to ensure full compatibility and performance.

Note

The Raspberry Pi 5 has a PCIe Gen 3 compliant controller, and thus has the required MSI-X support. It just runs at Gen 2 speed by default, which can be increased to Gen 3 speed with sudo mx_arm_setup.

Performance#

Why is the performance on an ARM board worse than on an x86 system?

ARM boards require a lower overall power consumption, so we set the default clock frequency of our module to 500 MHz for these architectures. Comparatively, the default frequency for x86 systems is 600 MHz.

The difference in clock frequency is the primary cause of observable performance differences between ARM and x86 systems, but other factors, such as lesser compute resources for video decoding, pre-processing, and post-processing, can also produce a decrease in end-to-end performance for ARM architectures.

Performance differences are always dependent on the model being run and may be addressed by altering output formats. Please refer to our Troubleshooting Page for supplementary guidance in performance troubleshooting.

Can PCIe bandwidth be a limiting factor for performance?

Whether PCIe bandwidth will be a limiting factor for performance is highly situational and dependent on the following factors:

  • the model being run (FPS and input/output resolution)

  • the host system used (# PCIe lanes, other compute-related performance limitations)

The M.2 module uses a PCIe Gen 3, 2-lane interface.

We rarely encounter PCIe bandwidth as the source of performance limitations. However, it is possible if a user is running a model that achieves a very high FPS throughput or has a very high input/output resolution such that the input featuremap data transfer to the M.2 module exceeds the theoretical limit. Even so, it is more likely that host system compute resources pose a limitation on end-to-end performance before the PCIe bandwidth limitation is encountered.

Note that Raspberry Pi host systems have only one PCIe lane, cutting the typical PCIe bandwidth in half. This is a case where PCIe bandwidth is more likely to pose a limitation.

Why does my locally compiled DFP benchmark with a lower FPS than the Model Explorer DFP?

All of the models on our Model Explorer page were compiled using a Closed Loop script with the --effort hard flag. When using the --effort hard flag with our Neural Compiler, multiple candidate mapping points will be generated, and the most efficient one is selected based on a cost function.

In addition to the performance improvements afforded by the --effort hard flag alone, the Closed Loop script incorporates benchmarking for each mapping point directly on the chip. This allows us to choose the mapping point which achieves the highest performance in terms of FPS.

Compiling models with the Closed Loop script will often result in better performance, but compilation time will increase significantly. For convenience, we make the optimal, pre-compiled DFPs available for download on the Model Explorer page.

If you still wish to compile the model on your own and would like to optimize performance, consider walking through the Closed Loop Compilation tutorial to get started.

The M.2 module only has up to 24 TOPS but still outperforms GPUs with 200+ TOPS. How is this possible?

TOPS (Tera-Operations per Second) is a common but often misleading concept in its implications for performance. The best indication of performance is actually the efficiency with which TOPS are converted into FPS for a given model, and the efficiency of this conversion is dependent on hardware utilization.

Because of the unique dataflow architecture employed by MemryX, our MX3 chips have a typical hardware utilization of ~30-60%; comparatively, a standard GPU will have a typical hardware utilization rate of less than 10%.

The higher the hardware utilization rate for a particular system, the more efficient the conversion from quoted TOPS to end performance (FPS).

Model Compilation#

What is model cropping, and why is it necessary?

Neural network models often have layers toward the beginning and end that perform pre-processing and post-processing operations. Because these are not traditional AI operations, these often do not benefit from being run by an AI accelerator versus the host CPU.

Models can also contain certain mathematical operations that are not natively supported on our chips, and as such, it is sometimes necessary to ‘crop’ these operations out of a model before mapping it to our hardware.

In a cropped model, pre-processing, post-processing, or unsupported portions of a model are cropped out and run on the host CPU, while the majority of the model runs on our accelerator. Cropping enables us to support a wider range of models while maintaining optimal performance.

Why are DFP files so big?

In some cases, a single .dfp file is actually made up of 2 separate data structures:

  1. Simulator Config

  2. MXA Config

The MXA Config is modest in size and used to program the chips. The Simulator Config, however, is much larger (often 4x) due to its hardware-accurate simulation data.

Use the dfp_inspect tool to see the size breakdown.

To reduce file size, the Simulator Config can be disabled using the --no_sim_dfp flag or no_sim_dfp=True in the Neural Compiler API.

If using our Neural Compiler on the command line, the Simulator Config will be disabled by default. If you wish to enable it via the Neural Compiler CLI, use the --sim_dfp flag.

My model is below the weight memory limitation of 42M parameters, but it won’t fit onto 4 chips when I compile. Why?

Each MX3 chip has a capacity of 10.5M parameters; for a 4-chip M.2 module, that’s a total capacity of 42M parameters. The number of parameters a model has is a good indicator for whether the model will be able to successfully map to our hardware, but it isn’t the only factor.

In addition to the number of parameters, our ability to map a model to our hardware as it pertains to resource capacity depends on the following factors:

  • number of nodes (also called layers) in the model graph

  • overall size of input featuremap for the model

Our MX3 chips have three types of resources that can pose limitations at the compilation step. The first resource is compute cores, which can become a resource limitation for models which have too many nodes to be properly mapped to the available compute cores.

The other two resources are two separate types of memory: weights memory is responsible for storing the model parameters, and feature-map memory holds the model input, output, and intermediate data during inference. When a model fails to compile due to insufficient resources, it is often the case that one or both of these memories have exceeded capacity during mapping.

Even models with far fewer parameters than the weight memory limitation of a chip or module can fail to compile due to an excessive requirement of feature-map memory or compute core resources.

Windows#

Why are the benchmark tools unavailable on WSL?

MemryX utilizes WSL to support our offline Python-based tools:

  • Simulator

  • Neural Compiler

  • DFP Inspect

But you can use our C++ benchmarking tool, acclBench, natively on Windows along with the rest of the C++ API.

How can I run benchmarking on Windows?

You can refer to our tutorial, Hello Windows!, to get started with benchmarking on Windows systems. Please forego the use of WSL when using the acclBench benchmarking tool.

Is Python runtime support available on Windows?

No, Python runtime support is not available on Windows or WSL. While WSL supports our offline Python tools, such as the Simulator and Neural Compiler, Python runtime is not supported because the WSL environment creates a barrier that prevents necessary communication with our drivers on the host system.

C++ runtime, however, is supported on native Windows.

In a future SDK release, native Python runtime support will be available on Windows.

General#

What is the difference between ‘multi-model’ and ‘multi-DFP’?

Multi-Model

In many applications, an inference pipeline may require more than one model. As an example, consider our tutorial on Face Detection and Emotion Recognition, where one model is used to detect a face in the input frame, and a second model takes the ROI containing the detected face as input and performs emotion recognition. In this tutorial, both of the models are co-mapped to our accelerator in a single DFP, which we refer to as a multi-model DFP.

Multi-DFP

With our latest SDK 2.0 release, we now enable the use of multiple DFPs in a single application. Imagine you want to use multiple models for an application of your choice, but the models are too large to compile to a single DFP for the number of MX3 chips installed on your system. Now, you can compile those models into two separate DFPs and swap which model is loaded onto the accelerator based on your application requirements.

Implications

With the semantics clarified, we can consider all possibilities afforded by the available permutations of multi-model and multi-DFP applications:

  • Single model, single DFP: The application requires only one model, and the model fits into a single DFP.

  • Single model, multi-DFP: The application requires only one model, but the model is too large to compile into one DFP. Thus, we can split the model into two DFPs and swap them for each input in an alternating fashion. (this feature will be available in the future)

  • Multi-model, single DFP: The application requires 2 or more models, and the models are able to fit together in a single DFP.

  • Multi-model, multi-DFP: The application requires 2 or more models, but the combined size of the models cannot compile into one DFP. We can compile the models separately into multiple DFPs and swap them as needed.

See also

Runtime Usage

Why are AVX instructions required for x86 systems?

While the MX3 chips themselves do not require AVX, our Python tools and API are built with dependencies on common neural network frameworks such as ONNX, TensorFlow, TFLite, etc., which strictly require the presence of AVX to operate on x86 systems.

Hint

On ARM systems, these dependencies are built with ARM NEON equivalents, not AVX, so there is no requirement for AVX on ARM systems (which doesn’t exist anyway).

Though the C++ Runtime can be built from source without AVX, pre/post models cannot be run, and general performance will be low. Therefore with SDK 2.0, we require AVX for the pre-packaged C++ Runtime on x86 as well.

What is the difference between pre/post models and pre/post-processing?

Neural networks will commonly have sections at the beginning or end of the model that perform pre-processing or post-processing operations on the input featuremap. When these sections of the model are cropped out by the Neural Compiler, we call a cropped portion from the beginning of a model the pre-model, or pre-processing model. Similarly, when the end portion of a model is cropped by the Neural Compiler, we refer to it as the post-model, or post-processing model.

It is easy to confuse these cropped pre- or post-processing sections of the model with actual pre-processing or post-processing steps, but they are distinct and separate in nature.

Even a cropped pre-model will require the implementation of pre-processing steps at the application level before an input can be given to the pre-model. These steps commonly include conversion to floating point format, resizing, and normalization operations. Similarly, even the output of a post-model will typically require additional post-processing steps, such as the rendering and drawing of bounding boxes, filtering of detections to achieve only those at or above a certain confidence threshold, etc.

I’m currently using CUDA, can I migrate to MemryX MX3?

The answer to this question depends on what you mean by ‘using CUDA’. Unless you are directly writing CUDA kernels in C++ and using the NVCC compiler, you are not technically using CUDA.

  • If you are running inference using PyTorch (or ONNX or Tensorflow or TFLite) with a CUDA backend, you are not using CUDA. Simply give your ONNX / TF / TFLite model to the Neural Compiler and it will try to compile it for MX3.

  • If you are truly using CUDA C++ as described above, we unfortunately do not support migration of that code to MX3. Please use standard neural net frameworks like PyTorch, ONNX, TF, or TFLite.

Logging and Telemetry#

How do I know my information is being sent anonymously?

The only customer-specific data collected is a randomized source ID, which can be changed at any time by the customer.

How do I change my ID or sharing preferences?

The ID and telemetry permission are stored in a JSON config file. On Linux, it’s located at: ~/.config/memx_config.json

Edit this file to update your ID or set telemetry to false:

{
  "anonymous_id": "998150d0fe6611eebcf6bd5b872ec7f5",
  "permit_telemetry": true
}
Why do you collect model compilation data?

This data helps us debug and optimize model compilation without requiring screenshots or logs from users.

Do you collect model structure or weights?

No. Only a list of operators and basic model stats are collected to ensure compatibility and performance.

Are you collecting system information?

We collect basic system info (e.g., OS, CPU) to identify compatibility issues and aid debugging.

Can I see what is being sent?

Yes, the memryx.neural_compiler.log file that’s generated by the Neural Compiler is the data which is sent. You can view it to see the exact information collected.

Is any of this information used for advertising or sent to 3rd parties?

No. All data remains within MemryX and is anonymous.