Release Notes#
v1.0#
General#
Python 3.11 and 3.12
The Python package now supports Pythons 3.9 through 3.12!
Important
We have dropped support for Python 3.8. Please see here for help if you cannot install 3.9 or above.
Compiler Support on ARM
The NeuralCompiler is now fully supported on ARM devices.
No need to compile models on x86 and copy them over!
ARM Device Setup Helper
The
mx_arm_setup
command is now part of the driver package on ARM.Run this command once after install to set up device tree overlays and/or other board-specific tweaks automatically.
Currently this script supports:
Raspberry Pi 5 (Raspberry Pi OS recommended)
Orange Pi 5 Plus
Orange Pi 5 Max
Radxa Rock 5B
We will continue to add more board support in the future.
Other boards may already work out-of-the-box. If you are having issues with your ARM platform, please reach out to MemryX for assistance.
DFP v6
The DFP file format has been revised from v5 to v6.
Although this is mostly an internal feature, it allows us to add new features to the SDK more easily in the future.
All libraries and the driver still support v5 files. But the NeuralCompiler will only output v6 from now on.
Runtime#
Multi-device Load Balancing
By simply passing a list of device IDs to use instead of a single number, the C++ runtime will automatically run your DFP on multiple M.2 cards and load balance your streams between them!
In other words: 2x M.2 == 2x FPS by changing a single line of code.
Automatic Pre/Post Runners (C++)
The C++ Accl API now supports automatic execution of cropped pre/post ONNX/TF models.
You no longer have to manage your own inference sessions!
The AsyncAccl Python API already had similar functionality for Python applications.
Please see the CenterNet C++ tutorial for a good example of its usage.
Note
These pre/post functions are provided by the memx-accl-plugins
package.
C++ GUI Helper Library
The Qt-based GUI toolkit library has been added to the SDK as a convenient alternative to OpenCV imshow or custom-written Qt.
C++ Manual Threading Mode Changes
The manual-threading mode of
MxAccl
has been moved to a separate class calledMxAcclMT
.The
receive_output
function can now block for an output from a specificstream_id
, instead of needing the user to sort frames to streams.
Updated Driver
The driver now supports the “low-BAR M.2” being sampled as part of a 4x M.2 kit to certain customers.
Continued performance improvements in the driver, particularly for high FPS models. Max model FPS can now reach ~40,000 instead of ~15,000.
Advanced users can now increase or decrease the M.2’s power & performance from the default, using the
mx_set_powermode
script.
Compiler#
New Framework Support
Added support for Keras 3 while maintaining backward compatibility with Keras 2.
Upgraded TensorFlow support to version 2.17.
Enhanced Performance and Support
Expanded operator support, including new and optimized operators.
Improved accuracy for Softmax and exponential operators.
The –effort hard compilation option has been further optimized, offering speedups across a wide range of models.
Some models have experienced substantial performance gains, increasing speed by several folds.
Faster Compilation
Compilation times have been significantly reduced, with speedups reaching up to 3x for certain models.
Stable API
The compiler APIs have reached a stable state (v1.0.0).
User Interface
Enhanced CLI visualization for a more intuitive user experience.
More detailed error messages providing step-by-step guidance to resolve issues.
Alpha/Beta Releases#
v0.10
Driver Performance Boosts
New output feature map optimization gives better PCIe bandwidth utilization. This boosts FPS for many-output models such as SSDs and newer YOLOs.
Solved bug that caused FPS to plateau around 2500 FPS. Now small models can easily exceed 15,000 FPS on the MX3.
Improved MX3 Firmware
The MX3 firmware now has a 600MHz base frequency for the M.2 (+20%), following extensive testing and characterization by our platform teams.
The MX3 thermal throttling temperature has been increased from 85C (Tj) to 100C, again following extensive characterization by our teams.
In the most, already-rare, situations where the MX3 hangs, it will now self-restart instead of requiring a host reboot.
Compiler Effort Control
The NeuralCompiler now has the
--effort
flag, which allows users to boost FPS substantially by allowing the Compiler more time to try different optimizations. Using –effort hard, users can get much higher FPS vs. the default (--effort normal
). Results will vary by model: we have observed from 1.0x to >10.0x.The use of
--effort hard
is now strongly recommended before deploying your model into a final product.But please note that use of
hard
will greatly increase the amount of time needed to compile your model: around 5x to 15x longer. In the next SDK release, this flag will be optimized to take only a tiny bit longer thannormal
.
Other Compiler Features
Expanded operator support and FPS boosts for existing models (even on
normal
).Improved
mx_nc
startup time.More optimized auto-selection of data packing format.
C++ API Performance Improvements
Single-stream scenarios can experience up to a ~30% boost on some host CPUs.
Manual-threading mode has significant performance boosts for multi-stream scenarios.
acclBench Tool
C++ version of the
mx_bench
Python tool, for better cross-platform support.acclBench
executable included with the C++ API package.See the benchmark page for more info.
Expanded Model Explorer
The Model Explorer has been expanded with new models and downloadable DFPs.
You can now track this page to monitor FPS improvements between SDK updates!
0.10.2
Accl: C++ fix for manual threading mode and ONNX models
Accl: added a
memx-accl-noavx
option for x86_64 systems that lack AVX2
0.10.1
Compiler: 16-bit weight bugfixes
Accl/Benchmark: Python bugfix for some ARM systems
v0.9
Neural Compiler Additions & Improvements
Logging: Introduced a comprehensive logging system for improved debugging and diagnostics for the Neural Compiler. Optionally, users can submit these logs to MemryX to help improve the Compiler.
Expanded Support: The Compiler has added additional operator support across various frameworks. For a detailed list please visit the operator support page.
Improved User Messaging: Updated compiler messages now feature actionable TODOs to help users address issues more effectively.
Performance Enhancements: Major improvements and bugfixes have significantly increased performance for some models.
C++ API Thread Management
The C++ version of the Multi-Stream Accl API now has support for limiting the number of Stream worker threads.
This feature can improve performance and/or reduce CPU overhead in many-stream applications such as VMS.
Driver Improvements
Input feature map bandwidth has been greatly optimized. Please note there is still a known issue with output feature map bandwidth that will be addressed in a future update.
Temperature data is now reported over the standard Linux hwmon interface.
Model Explorer
Discover and utilize models effortless with our new Model Explorer.
The MemryX SDK avoids manual tuning or retraining and is committed to running models out-of-the-box.
All models listed on the Model Explorer are directly from their original sources; no modifications have been made and no model-specific tunings have been done.
v0.8
Deprecated since version v0.8: Direct support for PyTorch 1
PyTorch models can be utilized by exporting them to ONNX (for more information see tutorial on exporting to ONNX). Direct support for Pytorch 2 will be available at a later date.
MX3 (production) Drivers
The MX3 PCIe driver and the pre-production USB driver are now avilable for all supported platforms.
The MX3 (Production) PCIe driver for Windows has achieved official signing and WHQL certification, ensuring reliability and compatibility.
MX3 (production) Neural Compiler
Complete support for MX3 (production) chips has been implemented.
We’ve broadened our support for various neural network frameworks. For a comprehensive list of supported operators, please refer to the operator support page.
Introduction of new graph processing techniques to broaden model support and enhance chip performance.
Improved resource mapping for better chip performance, ensuring more efficient utilization of chip capabilities.
The autocropper feature has been improved for better usability.
Mixed-precision features have been ported to the new graph processor (e.g.,
-bt
and--exp_auto_dp
).Enhanced user messaging and Command Line Interface (CLI) information for a better user experience.
Numerous optimizations and fixes have been implemented to improve stability and performance.
Accelerator APIs
The AsyncAccl Python API has been expanded to support binding multiple callbacks for parallel streaming scenarios, such as Video Management System (VMS) applications.
Introducing a new C++ Accl API, designed to replace the previous version. This new API is capable of binding multiple callbacks for parallel streaming scenarios, enhancing support for applications like VMS.
0.8.1 - 0.8.4
Compiler: Important bugfix for Softmax
mx_bench: Performance improvements
Compiler: Minor bugfixes for 16-bit quantization
Compiler: Important bugfix for the Invert instruction
v0.7
New Graph Processing Engine
The Frame I/F and Graph Processing layers of the Compiler have been completely rewritten.
- This “Graph 2” effort brings with it:
Faster compile times
Wider operator support
More stability and model support
- Graph 2 is now the default for Keras, TensorFlow, TFLite, and ONNX frameworks. PyTorch remains on Graph 1 for now.
We recommend exporting PyTorch models to ONNX to use the benefits of Graph 2
Some compiler flags have not yet been ported from Graph 1. The compiler will automatically switch to Graph 1 when these are detected.
Mapper Enhancements
The Mapping layer of the Compiler has seen extensive bug fixes.
Over 1000 models have now been compiled successfully on MX3.
Improved MX3 Production (Cascade+) Performance
The compiler has been greatly improved for the Cascade+ architecture, increasing performance and stability.
Automated Operator Support Page
The operator support page is now automatically generated for each framework.
Option to Exclude Simulator DFP
To decrease the file size of DFPs, the new
--no_sim_dfp
flag can be used to exclude simulator config info.The DFP will still work with real MXA hardware, and will be a much smaller file.
Expanded Operator Support
- Between the different frameworks, support for many new layers has been added, including:
GELU
FastGELU
Mish
Sqrt
Log
LayerNorm
Attention
MatMul
Depth-to-Space
TransposeDepthwiseConv2D
Windows Driver/Runtime Early Access
Windows-native versions of the Driver and Accl APIs are now available for MX3 preproduction EVB1/EVB2 boards!
v0.1 to v0.6
Removed so we don’t clutter up this page :-)