Release Notes#

v2.0#

The Level Up release – July 28, 2025

SDK 2.0 is a monumental release that introduces major new features and improvements across the entire software stack.

Compiler#

Neural Compiler Extensions (NCE)
  • New mechanism for extending Neural Compiler support: install .nce plugin files to add or patch graph handling between SDK releases.

  • Extensions can hot-fix graph issues or add decompositions for new ops.

  • Some extensions may merge into the core compiler in future releases; others can remain private (e.g., proprietary models).

  • Future SDK versions will include an online repository for official extensions and developer documentation.

Multi-Threaded Compilation

With –effort hard, optimization steps now parallelize across CPU threads—significantly reducing compile times on high-core systems. Use -j to specify the number of cores.

Model and Operator Support Enhancements
Model support:
  • YOLO v10 & v11 (nano, small, medium)

  • ConvNeXT

  • ViT-small

  • … and more

Operator support:
  • Fractional up-/down-sampling (e.g., 500 → 300 = 0.6×)

  • Modulus (%)

  • Enhanced TransposeConv

  • Improved handling of non-standard tensor shapes/layouts

DFP Operator Folding and Batch Support
  • Boundary operations (transpose/reshape) now execute within the DFP.

  • Original tensor shapes are stored in the DFP; runtime reshapes occur automatically.

  • Transparent batch-dimension handling for seamless deployment.

Under-the-Hood Improvements
  • Complete rewrite of shape-handling logic

  • Reimplementation of TFLite, Keras, and ONNX loaders

  • Numerous bug fixes and architectural cleanups

Runtime#

Multi-DFP Support

C++ and Python APIs now support loading multiple DFPs concurrently. (Co-mapping is still the preferred method.)

All-New MXA-Manager
  • Rebuilt with a custom networking/scheduling stack (replaces gRPC).

  • Shared Mode matches Local Mode in FPS and CPU usage.

  • Monitors temperature and power, sharing data with all clients.

  • Supports arbitrary combinations of clients, DFPs, and usage scenarios.

  • Provides user-configurable performance tuning knobs.

Socketfiles Replace IP Networking
  • MXA-Manager now uses UNIX domain sockets by default for inter-process communication.

  • Provides equivalent or better performance, with improved security and deployment simplicity.

  • TCP/IP remains supported and is faster than before.

DFP Shape Folding
  • Folded reshape operations now run transparently on the host—no need for auto-crop.

Python Shared Mode Support
  • Python Runtime now supports Shared Mode and Multi-DFP functionality, achieving feature parity with the C++ API.

Python Multi-Device Load Balancing
  • Python API can distribute inference workloads across multiple connected MX3 M.2 modules, just like C++.

Faster C++ Runtime
  • Input callbacks perform fewer memcpy operations and use a new matrix transposition method, improving FPS and reducing CPU load, especially on ARM SoCs.

MXA-Manager for Windows
  • Now included in the Windows SDK, enabling Shared Mode and Multi-DFP support on Windows.

Driver#

Faster DFP Downloads
  • Linux and Windows drivers switch DFPs up to 4× faster than SDK 1.2.

  • Faster downloads and caching of pre-parsed DFPs.

Improved Host System Compatibility
  • MX3 firmware detects and optimizes for select x86 platforms.
    • Tuned MSI-X capability

    • Faster driver boot times

Standard Windows Installer
  • A standard Windows installer has been released for easier setup.

Fixes
  • Multiple bug fixes and stability improvements

Utils#

Improved GUI Toolkit
  • Significant bug fixes and performance optimizations for the Qt-based C++ toolkit.

Developer Hub#

Expanded Model Explorer Sources
  • The open-source timm models are included in Model Explorer.

Tutorials
  • Updated tutorials now point to the MemryX Examples page to share a unified, updated codebase.

  • Added new Windows “Getting Started” tutorials.

v1.2#

The Snowday release – Mar. 5, 2025

SDK 1.2 is a light, yet very important release that updates the MX3 firmware and driver, and improves runtime stability.

Runtime Enhancements#

Driver & Firmware Updates
  • Updated drivers and firmware to enhance performance and overall stability.

  • New firmware improves MX3 M.2 compatibility with host PCIe controllers.

MXA-Manager Improvements
  • Renamed mx-server to mxa-manager to more accurately convey its purpose.

  • Implemented several fixes and refinements.

Windows Updates
  • Windows and Linux versions are aligned again (both 1.2)

  • All fixes and improvements to driver, firmware, and runtime are included

  • NOTE: but mxa-manager remains Linux-only for now

v1.1#

The Holiday Special release – Dec. 23, 2024

SDK 1.1 focuses on the Linux Driver and C++ Runtime, with performance improvements and new options for certain use cases. The NeuralCompiler has several improvements as well.

Runtime#

Driver Improvements
  • Enhanced performance with the latest driver, with some models achieving up to 30% improvement in FPS.

Firmware Updater
  • The Linux driver package memx-drivers will auto-update the firmware (if required) on connected devices.

  • The firmware update tool is also available to use directly.

MXA-Manager (Preview)
  • Multiple concurrent processes can now share the same MX3 M.2 using the new mxa_manager Linux service!

  • This service also allows for use of the MX3 in parallel from different Docker containers, and even sharing over an IP network.

  • Please see the first MXA-Manager tutorial for an introduction.

Important

MXA-Manager is released in a preview state for SDK 1.1 and will undergo improvements in future SDKs. The user-facing API for the C++ Runtime is considered stable, but under the hood there are known performance issues that we are working to address.

When possible, users should continue to use the C++ Runtime as before: one or more video streams (using threads) in a single process.

No-Copy C++ Runtime Option
  • For users who want more fine-grained memory management in our C++ Runtime, “no-copy” versions of the get_data functions have been added.

  • These versions do not copy data, but instead operate directly on the pointers supplied to them.

  • This feature is espeically helpful on low-end ARM Linux systems, where copying memory can add noticeable latency.

  • This is an advanced feature that requires application developers to be aware of the implications.

  • Please see the No-Copy Accl API tutorial for information on this feature.

Compiler#

Expanded Support and Enhanced Performance and Stability
  • Check out the operator support page for a list of all supported operators.

  • Multiple fixes and optimizations have been made to the compiler.

Naming Convention
  • Cropped pre/post models now follow updated naming conventions, where the model number is appended to the model name only if more than one model is compiled concurrently.

Developer Hub#

Model eXplorer
New Tutorials
  • The tutorials section has been updated with new tutorials.

  • The tutorials cover a wide range of topics, from basic model compilation to advanced features like the MXA-Manager.

Subreleases#

v1.1.1 → v1.1.5
  • Minor fixes and improvements.

Note

SDK 1.1 features are Linux-only for now, while Windows features remain at SDK 1.0. An upcoming SDK release will bring Windows back up to feature parity with Linux.

v1.0#

The Hello World release – Oct. 1, 2024

MemryX SDK 1.0 is our first publicly available SDK, and over the previous 0.10 version, it includes many important new features.

General#

Python 3.11 and 3.12

Important

We have dropped support for Python 3.8. Please see here for help if you cannot install 3.9 or above.

Compiler Support on ARM
  • The NeuralCompiler is now fully supported on ARM devices.

  • No need to compile models on x86 and copy them over!

ARM Device Setup Helper
  • The mx_arm_setup command is now part of the driver package on ARM.

  • Run this command once after install to set up device tree overlays and/or other board-specific tweaks automatically.

  • Currently this script supports:

    • Raspberry Pi 5 (Raspberry Pi OS recommended)

    • Orange Pi 5 Plus

    • Orange Pi 5 Max

    • Radxa Rock 5B

  • We will continue to add more board support in the future.

  • Other boards may already work out-of-the-box. If you are having issues with your ARM platform, please reach out to MemryX for assistance.

DFP v6
  • The DFP file format has been revised from v5 to v6.

  • Although this is mostly an internal feature, it allows us to add new features to the SDK more easily in the future.

  • All libraries and the driver still support v5 files. But the NeuralCompiler will only output v6 from now on.

Runtime#

Multi-device Load Balancing
  • By simply passing a list of device IDs to use instead of a single number, the C++ runtime will automatically run your DFP on multiple M.2 cards and load balance your streams between them!

  • In other words: 2x M.2 == 2x FPS by changing a single line of code.

Automatic Pre/Post Runners (C++)
  • The C++ Accl API now supports automatic execution of cropped pre/post ONNX/TF models.

  • You no longer have to manage your own inference sessions!

  • The AsyncAccl Python API already had similar functionality for Python applications.

  • Please see the CenterNet C++ tutorial for a good example of its usage.

Note

These pre/post functions are provided by the memx-accl-plugins package.

C++ GUI Helper Library
  • The Qt-based GUI toolkit library has been added to the SDK as a convenient alternative to OpenCV imshow or custom-written Qt.

C++ Manual Threading Mode Changes
  • The manual-threading mode of MxAccl has been moved to a separate class called MxAcclMT.

  • The receive_output function can now block for an output from a specific stream_id, instead of needing the user to sort frames to streams.

Updated Driver
  • The driver now supports the “low-BAR M.2” being sampled as part of a 4x M.2 kit to certain customers.

  • Continued performance improvements in the driver, particularly for high FPS models. Max model FPS can now reach ~40,000 instead of ~15,000.

  • Advanced users can now increase or decrease the M.2’s power & performance from the default, using the mx_set_powermode script.

Compiler#

New Framework Support
  • Added support for Keras 3 while maintaining backward compatibility with Keras 2.

  • Upgraded TensorFlow support to version 2.17.

Enhanced Performance and Support
  • Expanded operator support, including new and optimized operators.

  • Improved accuracy for Softmax and exponential operators.

  • The –effort hard compilation option has been further optimized, offering speedups across a wide range of models.

  • Some models have experienced substantial performance gains, increasing speed by several folds.

Faster Compilation
  • Compilation times have been significantly reduced, with speedups reaching up to 3x for certain models.

Stable API
  • The compiler APIs have reached a stable state (v1.0.0).

User Interface
  • Enhanced CLI visualization for a more intuitive user experience.

  • More detailed error messages providing step-by-step guidance to resolve issues.


Alpha/Beta Releases#

v0.10
Driver Performance Boosts
  • New output feature map optimization gives better PCIe bandwidth utilization. This boosts FPS for many-output models such as SSDs and newer YOLOs.

  • Solved bug that caused FPS to plateau around 2500 FPS. Now small models can easily exceed 15,000 FPS on the MX3.

Improved MX3 Firmware
  • The MX3 firmware now has a 600MHz base frequency for the M.2 (+20%), following extensive testing and characterization by our platform teams.

  • The MX3 thermal throttling temperature has been increased from 85C (Tj) to 100C, again following extensive characterization by our teams.

  • In the most, already-rare, situations where the MX3 hangs, it will now self-restart instead of requiring a host reboot.

Compiler Effort Control
  • The NeuralCompiler now has the --effort flag, which allows users to boost FPS substantially by allowing the Compiler more time to try different optimizations. Using –effort hard, users can get much higher FPS vs. the default (--effort normal). Results will vary by model: we have observed from 1.0x to >10.0x.

  • The use of --effort hard is now strongly recommended before deploying your model into a final product.

  • But please note that use of hard will greatly increase the amount of time needed to compile your model: around 5x to 15x longer. In the next SDK release, this flag will be optimized to take only a tiny bit longer than normal.

Other Compiler Features
C++ API Performance Improvements
  • Single-stream scenarios can experience up to a ~30% boost on some host CPUs.

  • Manual-threading mode has significant performance boosts for multi-stream scenarios.

acclBench Tool
  • C++ version of the mx_bench Python tool, for better cross-platform support.

  • acclBench executable included with the C++ API package.

  • See the benchmark page for more info.

Expanded Model Explorer
  • The Model Explorer has been expanded with new models and downloadable DFPs.

  • You can now track this page to monitor FPS improvements between SDK updates!

0.10.2
  • Accl: C++ fix for manual threading mode and ONNX models

  • Accl: added a memx-accl-noavx option for x86_64 systems that lack AVX2

0.10.1
  • Compiler: 16-bit weight bugfixes

  • Accl/Benchmark: Python bugfix for some ARM systems

v0.9
Neural Compiler Additions & Improvements
  • Logging: Introduced a comprehensive logging system for improved debugging and diagnostics for the Neural Compiler. Optionally, users can submit these logs to MemryX to help improve the Compiler.

  • Expanded Support: The Compiler has added additional operator support across various frameworks. For a detailed list please visit the operator support page.

  • Improved User Messaging: Updated compiler messages now feature actionable TODOs to help users address issues more effectively.

  • Performance Enhancements: Major improvements and bugfixes have significantly increased performance for some models.

C++ API Thread Management
  • The C++ version of the Multi-Stream Accl API now has support for limiting the number of Stream worker threads.

  • This feature can improve performance and/or reduce CPU overhead in many-stream applications such as VMS.

Driver Improvements
  • Input feature map bandwidth has been greatly optimized. Please note there is still a known issue with output feature map bandwidth that will be addressed in a future update.

  • Temperature data is now reported over the standard Linux hwmon interface.

Model Explorer
  • Discover and utilize models effortless with our new Model Explorer.

  • The MemryX SDK avoids manual tuning or retraining and is committed to running models out-of-the-box.

  • All models listed on the Model Explorer are directly from their original sources; no modifications have been made and no model-specific tunings have been done.

v0.1 to v0.8

Removed so we don’t clutter up this page :-)