Release Notes#
v2.0#
The Level Up release – July 28, 2025
SDK 2.0 is a monumental release that introduces major new features and improvements across the entire software stack.
Compiler#
Neural Compiler Extensions (NCE)
New mechanism for extending Neural Compiler support: install .nce plugin files to add or patch graph handling between SDK releases.
Extensions can hot-fix graph issues or add decompositions for new ops.
Some extensions may merge into the core compiler in future releases; others can remain private (e.g., proprietary models).
Future SDK versions will include an online repository for official extensions and developer documentation.
Multi-Threaded Compilation
With –effort hard, optimization steps now parallelize across CPU threads—significantly reducing compile times on high-core systems. Use -j to specify the number of cores.
Model and Operator Support Enhancements
- Model support:
YOLO v10 & v11 (nano, small, medium)
ConvNeXT
ViT-small
… and more
- Operator support:
Fractional up-/down-sampling (e.g., 500 → 300 = 0.6×)
Modulus (%)
Enhanced TransposeConv
Improved handling of non-standard tensor shapes/layouts
DFP Operator Folding and Batch Support
Boundary operations (transpose/reshape) now execute within the DFP.
Original tensor shapes are stored in the DFP; runtime reshapes occur automatically.
Transparent batch-dimension handling for seamless deployment.
Under-the-Hood Improvements
Complete rewrite of shape-handling logic
Reimplementation of TFLite, Keras, and ONNX loaders
Numerous bug fixes and architectural cleanups
Runtime#
Multi-DFP Support
C++ and Python APIs now support loading multiple DFPs concurrently. (Co-mapping is still the preferred method.)
All-New MXA-Manager
Rebuilt with a custom networking/scheduling stack (replaces gRPC).
Shared Mode matches Local Mode in FPS and CPU usage.
Monitors temperature and power, sharing data with all clients.
Supports arbitrary combinations of clients, DFPs, and usage scenarios.
Provides user-configurable performance tuning knobs.
Socketfiles Replace IP Networking
MXA-Manager now uses UNIX domain sockets by default for inter-process communication.
Provides equivalent or better performance, with improved security and deployment simplicity.
TCP/IP remains supported and is faster than before.
DFP Shape Folding
Folded reshape operations now run transparently on the host—no need for auto-crop.
Python Shared Mode Support
Python Runtime now supports Shared Mode and Multi-DFP functionality, achieving feature parity with the C++ API.
Python Multi-Device Load Balancing
Python API can distribute inference workloads across multiple connected MX3 M.2 modules, just like C++.
Faster C++ Runtime
Input callbacks perform fewer memcpy operations and use a new matrix transposition method, improving FPS and reducing CPU load, especially on ARM SoCs.
MXA-Manager for Windows
Now included in the Windows SDK, enabling Shared Mode and Multi-DFP support on Windows.
Driver#
Faster DFP Downloads
Linux and Windows drivers switch DFPs up to 4× faster than SDK 1.2.
Faster downloads and caching of pre-parsed DFPs.
Improved Host System Compatibility
- MX3 firmware detects and optimizes for select x86 platforms.
Tuned MSI-X capability
Faster driver boot times
Standard Windows Installer
A standard Windows installer has been released for easier setup.
Fixes
Multiple bug fixes and stability improvements
Utils#
Improved GUI Toolkit
Significant bug fixes and performance optimizations for the Qt-based C++ toolkit.
Developer Hub#
Expanded Model Explorer Sources
The open-source timm models are included in Model Explorer.
Tutorials
Updated tutorials now point to the MemryX Examples page to share a unified, updated codebase.
Added new Windows “Getting Started” tutorials.
v1.2#
The Snowday release – Mar. 5, 2025
SDK 1.2 is a light, yet very important release that updates the MX3 firmware and driver, and improves runtime stability.
Runtime Enhancements#
Driver & Firmware Updates
Updated drivers and firmware to enhance performance and overall stability.
New firmware improves MX3 M.2 compatibility with host PCIe controllers.
MXA-Manager Improvements
Renamed mx-server to mxa-manager to more accurately convey its purpose.
Implemented several fixes and refinements.
Windows Updates
Windows and Linux versions are aligned again (both 1.2)
All fixes and improvements to driver, firmware, and runtime are included
NOTE: but mxa-manager remains Linux-only for now
v1.1#
The Holiday Special release – Dec. 23, 2024
SDK 1.1 focuses on the Linux Driver and C++ Runtime, with performance improvements and new options for certain use cases. The NeuralCompiler has several improvements as well.
Runtime#
Driver Improvements
Enhanced performance with the latest driver, with some models achieving up to 30% improvement in FPS.
Firmware Updater
The Linux driver package
memx-drivers
will auto-update the firmware (if required) on connected devices.The firmware update tool is also available to use directly.
MXA-Manager (Preview)
Multiple concurrent processes can now share the same MX3 M.2 using the new
mxa_manager
Linux service!This service also allows for use of the MX3 in parallel from different Docker containers, and even sharing over an IP network.
Please see the first MXA-Manager tutorial for an introduction.
Important
MXA-Manager is released in a preview state for SDK 1.1 and will undergo improvements in future SDKs. The user-facing API for the C++ Runtime is considered stable, but under the hood there are known performance issues that we are working to address.
When possible, users should continue to use the C++ Runtime as before: one or more video streams (using threads) in a single process.
No-Copy C++ Runtime Option
For users who want more fine-grained memory management in our C++ Runtime, “no-copy” versions of the
get_data
functions have been added.These versions do not copy data, but instead operate directly on the pointers supplied to them.
This feature is espeically helpful on low-end ARM Linux systems, where copying memory can add noticeable latency.
This is an advanced feature that requires application developers to be aware of the implications.
Please see the No-Copy Accl API tutorial for information on this feature.
Compiler#
Expanded Support and Enhanced Performance and Stability
Check out the operator support page for a list of all supported operators.
Multiple fixes and optimizations have been made to the compiler.
Naming Convention
Cropped pre/post models now follow updated naming conventions, where the model number is appended to the model name only if more than one model is compiled concurrently.
Developer Hub#
Model eXplorer
The Model eXplorer has been updated with new models and DFPs.
New Tutorials
The tutorials section has been updated with new tutorials.
The tutorials cover a wide range of topics, from basic model compilation to advanced features like the MXA-Manager.
Subreleases#
v1.1.1 → v1.1.5
Minor fixes and improvements.
Note
SDK 1.1 features are Linux-only for now, while Windows features remain at SDK 1.0. An upcoming SDK release will bring Windows back up to feature parity with Linux.
v1.0#
The Hello World release – Oct. 1, 2024
MemryX SDK 1.0 is our first publicly available SDK, and over the previous 0.10 version, it includes many important new features.
General#
Python 3.11 and 3.12
The Python package now supports Pythons 3.9 through 3.12!
Important
We have dropped support for Python 3.8. Please see here for help if you cannot install 3.9 or above.
Compiler Support on ARM
The NeuralCompiler is now fully supported on ARM devices.
No need to compile models on x86 and copy them over!
ARM Device Setup Helper
The
mx_arm_setup
command is now part of the driver package on ARM.Run this command once after install to set up device tree overlays and/or other board-specific tweaks automatically.
Currently this script supports:
Raspberry Pi 5 (Raspberry Pi OS recommended)
Orange Pi 5 Plus
Orange Pi 5 Max
Radxa Rock 5B
We will continue to add more board support in the future.
Other boards may already work out-of-the-box. If you are having issues with your ARM platform, please reach out to MemryX for assistance.
DFP v6
The DFP file format has been revised from v5 to v6.
Although this is mostly an internal feature, it allows us to add new features to the SDK more easily in the future.
All libraries and the driver still support v5 files. But the NeuralCompiler will only output v6 from now on.
Runtime#
Multi-device Load Balancing
By simply passing a list of device IDs to use instead of a single number, the C++ runtime will automatically run your DFP on multiple M.2 cards and load balance your streams between them!
In other words: 2x M.2 == 2x FPS by changing a single line of code.
Automatic Pre/Post Runners (C++)
The C++ Accl API now supports automatic execution of cropped pre/post ONNX/TF models.
You no longer have to manage your own inference sessions!
The AsyncAccl Python API already had similar functionality for Python applications.
Please see the CenterNet C++ tutorial for a good example of its usage.
Note
These pre/post functions are provided by the memx-accl-plugins
package.
C++ GUI Helper Library
The Qt-based GUI toolkit library has been added to the SDK as a convenient alternative to OpenCV imshow or custom-written Qt.
C++ Manual Threading Mode Changes
The manual-threading mode of
MxAccl
has been moved to a separate class calledMxAcclMT
.The
receive_output
function can now block for an output from a specificstream_id
, instead of needing the user to sort frames to streams.
Updated Driver
The driver now supports the “low-BAR M.2” being sampled as part of a 4x M.2 kit to certain customers.
Continued performance improvements in the driver, particularly for high FPS models. Max model FPS can now reach ~40,000 instead of ~15,000.
Advanced users can now increase or decrease the M.2’s power & performance from the default, using the
mx_set_powermode
script.
Compiler#
New Framework Support
Added support for Keras 3 while maintaining backward compatibility with Keras 2.
Upgraded TensorFlow support to version 2.17.
Enhanced Performance and Support
Expanded operator support, including new and optimized operators.
Improved accuracy for Softmax and exponential operators.
The –effort hard compilation option has been further optimized, offering speedups across a wide range of models.
Some models have experienced substantial performance gains, increasing speed by several folds.
Faster Compilation
Compilation times have been significantly reduced, with speedups reaching up to 3x for certain models.
Stable API
The compiler APIs have reached a stable state (v1.0.0).
User Interface
Enhanced CLI visualization for a more intuitive user experience.
More detailed error messages providing step-by-step guidance to resolve issues.
Alpha/Beta Releases#
v0.10
Driver Performance Boosts
New output feature map optimization gives better PCIe bandwidth utilization. This boosts FPS for many-output models such as SSDs and newer YOLOs.
Solved bug that caused FPS to plateau around 2500 FPS. Now small models can easily exceed 15,000 FPS on the MX3.
Improved MX3 Firmware
The MX3 firmware now has a 600MHz base frequency for the M.2 (+20%), following extensive testing and characterization by our platform teams.
The MX3 thermal throttling temperature has been increased from 85C (Tj) to 100C, again following extensive characterization by our teams.
In the most, already-rare, situations where the MX3 hangs, it will now self-restart instead of requiring a host reboot.
Compiler Effort Control
The NeuralCompiler now has the
--effort
flag, which allows users to boost FPS substantially by allowing the Compiler more time to try different optimizations. Using –effort hard, users can get much higher FPS vs. the default (--effort normal
). Results will vary by model: we have observed from 1.0x to >10.0x.The use of
--effort hard
is now strongly recommended before deploying your model into a final product.But please note that use of
hard
will greatly increase the amount of time needed to compile your model: around 5x to 15x longer. In the next SDK release, this flag will be optimized to take only a tiny bit longer thannormal
.
Other Compiler Features
Expanded operator support and FPS boosts for existing models (even on
normal
).Improved
mx_nc
startup time.More optimized auto-selection of data packing format.
C++ API Performance Improvements
Single-stream scenarios can experience up to a ~30% boost on some host CPUs.
Manual-threading mode has significant performance boosts for multi-stream scenarios.
acclBench Tool
C++ version of the
mx_bench
Python tool, for better cross-platform support.acclBench
executable included with the C++ API package.See the benchmark page for more info.
Expanded Model Explorer
The Model Explorer has been expanded with new models and downloadable DFPs.
You can now track this page to monitor FPS improvements between SDK updates!
0.10.2
Accl: C++ fix for manual threading mode and ONNX models
Accl: added a
memx-accl-noavx
option for x86_64 systems that lack AVX2
0.10.1
Compiler: 16-bit weight bugfixes
Accl/Benchmark: Python bugfix for some ARM systems
v0.9
Neural Compiler Additions & Improvements
Logging: Introduced a comprehensive logging system for improved debugging and diagnostics for the Neural Compiler. Optionally, users can submit these logs to MemryX to help improve the Compiler.
Expanded Support: The Compiler has added additional operator support across various frameworks. For a detailed list please visit the operator support page.
Improved User Messaging: Updated compiler messages now feature actionable TODOs to help users address issues more effectively.
Performance Enhancements: Major improvements and bugfixes have significantly increased performance for some models.
C++ API Thread Management
The C++ version of the Multi-Stream Accl API now has support for limiting the number of Stream worker threads.
This feature can improve performance and/or reduce CPU overhead in many-stream applications such as VMS.
Driver Improvements
Input feature map bandwidth has been greatly optimized. Please note there is still a known issue with output feature map bandwidth that will be addressed in a future update.
Temperature data is now reported over the standard Linux hwmon interface.
Model Explorer
Discover and utilize models effortless with our new Model Explorer.
The MemryX SDK avoids manual tuning or retraining and is committed to running models out-of-the-box.
All models listed on the Model Explorer are directly from their original sources; no modifications have been made and no model-specific tunings have been done.
v0.1 to v0.8
Removed so we don’t clutter up this page :-)