Release Notes#

v1.0#

General#

Python 3.11 and 3.12

Important

We have dropped support for Python 3.8. Please see here for help if you cannot install 3.9 or above.

Compiler Support on ARM
  • The NeuralCompiler is now fully supported on ARM devices.

  • No need to compile models on x86 and copy them over!

ARM Device Setup Helper
  • The mx_arm_setup command is now part of the driver package on ARM.

  • Run this command once after install to set up device tree overlays and/or other board-specific tweaks automatically.

  • Currently this script supports:

    • Raspberry Pi 5 (Raspberry Pi OS recommended)

    • Orange Pi 5 Plus

    • Orange Pi 5 Max

    • Radxa Rock 5B

  • We will continue to add more board support in the future.

  • Other boards may already work out-of-the-box. If you are having issues with your ARM platform, please reach out to MemryX for assistance.

DFP v6
  • The DFP file format has been revised from v5 to v6.

  • Although this is mostly an internal feature, it allows us to add new features to the SDK more easily in the future.

  • All libraries and the driver still support v5 files. But the NeuralCompiler will only output v6 from now on.

Runtime#

Multi-device Load Balancing
  • By simply passing a list of device IDs to use instead of a single number, the C++ runtime will automatically run your DFP on multiple M.2 cards and load balance your streams between them!

  • In other words: 2x M.2 == 2x FPS by changing a single line of code.

Automatic Pre/Post Runners (C++)
  • The C++ Accl API now supports automatic execution of cropped pre/post ONNX/TF models.

  • You no longer have to manage your own inference sessions!

  • The AsyncAccl Python API already had similar functionality for Python applications.

  • Please see the CenterNet C++ tutorial for a good example of its usage.

Note

These pre/post functions are provided by the memx-accl-plugins package.

C++ GUI Helper Library
  • The Qt-based GUI toolkit library has been added to the SDK as a convenient alternative to OpenCV imshow or custom-written Qt.

C++ Manual Threading Mode Changes
  • The manual-threading mode of MxAccl has been moved to a separate class called MxAcclMT.

  • The receive_output function can now block for an output from a specific stream_id, instead of needing the user to sort frames to streams.

Updated Driver
  • The driver now supports the “low-BAR M.2” being sampled as part of a 4x M.2 kit to certain customers.

  • Continued performance improvements in the driver, particularly for high FPS models. Max model FPS can now reach ~40,000 instead of ~15,000.

  • Advanced users can now increase or decrease the M.2’s power & performance from the default, using the mx_set_powermode script.

Compiler#

New Framework Support
  • Added support for Keras 3 while maintaining backward compatibility with Keras 2.

  • Upgraded TensorFlow support to version 2.17.

Enhanced Performance and Support
  • Expanded operator support, including new and optimized operators.

  • Improved accuracy for Softmax and exponential operators.

  • The –effort hard compilation option has been further optimized, offering speedups across a wide range of models.

  • Some models have experienced substantial performance gains, increasing speed by several folds.

Faster Compilation
  • Compilation times have been significantly reduced, with speedups reaching up to 3x for certain models.

Stable API
  • The compiler APIs have reached a stable state (v1.0.0).

User Interface
  • Enhanced CLI visualization for a more intuitive user experience.

  • More detailed error messages providing step-by-step guidance to resolve issues.

Alpha/Beta Releases#

v0.10
Driver Performance Boosts
  • New output feature map optimization gives better PCIe bandwidth utilization. This boosts FPS for many-output models such as SSDs and newer YOLOs.

  • Solved bug that caused FPS to plateau around 2500 FPS. Now small models can easily exceed 15,000 FPS on the MX3.

Improved MX3 Firmware
  • The MX3 firmware now has a 600MHz base frequency for the M.2 (+20%), following extensive testing and characterization by our platform teams.

  • The MX3 thermal throttling temperature has been increased from 85C (Tj) to 100C, again following extensive characterization by our teams.

  • In the most, already-rare, situations where the MX3 hangs, it will now self-restart instead of requiring a host reboot.

Compiler Effort Control
  • The NeuralCompiler now has the --effort flag, which allows users to boost FPS substantially by allowing the Compiler more time to try different optimizations. Using –effort hard, users can get much higher FPS vs. the default (--effort normal). Results will vary by model: we have observed from 1.0x to >10.0x.

  • The use of --effort hard is now strongly recommended before deploying your model into a final product.

  • But please note that use of hard will greatly increase the amount of time needed to compile your model: around 5x to 15x longer. In the next SDK release, this flag will be optimized to take only a tiny bit longer than normal.

Other Compiler Features
C++ API Performance Improvements
  • Single-stream scenarios can experience up to a ~30% boost on some host CPUs.

  • Manual-threading mode has significant performance boosts for multi-stream scenarios.

acclBench Tool
  • C++ version of the mx_bench Python tool, for better cross-platform support.

  • acclBench executable included with the C++ API package.

  • See the benchmark page for more info.

Expanded Model Explorer
  • The Model Explorer has been expanded with new models and downloadable DFPs.

  • You can now track this page to monitor FPS improvements between SDK updates!

0.10.2
  • Accl: C++ fix for manual threading mode and ONNX models

  • Accl: added a memx-accl-noavx option for x86_64 systems that lack AVX2

0.10.1
  • Compiler: 16-bit weight bugfixes

  • Accl/Benchmark: Python bugfix for some ARM systems

v0.9
Neural Compiler Additions & Improvements
  • Logging: Introduced a comprehensive logging system for improved debugging and diagnostics for the Neural Compiler. Optionally, users can submit these logs to MemryX to help improve the Compiler.

  • Expanded Support: The Compiler has added additional operator support across various frameworks. For a detailed list please visit the operator support page.

  • Improved User Messaging: Updated compiler messages now feature actionable TODOs to help users address issues more effectively.

  • Performance Enhancements: Major improvements and bugfixes have significantly increased performance for some models.

C++ API Thread Management
  • The C++ version of the Multi-Stream Accl API now has support for limiting the number of Stream worker threads.

  • This feature can improve performance and/or reduce CPU overhead in many-stream applications such as VMS.

Driver Improvements
  • Input feature map bandwidth has been greatly optimized. Please note there is still a known issue with output feature map bandwidth that will be addressed in a future update.

  • Temperature data is now reported over the standard Linux hwmon interface.

Model Explorer
  • Discover and utilize models effortless with our new Model Explorer.

  • The MemryX SDK avoids manual tuning or retraining and is committed to running models out-of-the-box.

  • All models listed on the Model Explorer are directly from their original sources; no modifications have been made and no model-specific tunings have been done.

v0.8

Deprecated since version v0.8: Direct support for PyTorch 1

PyTorch models can be utilized by exporting them to ONNX (for more information see tutorial on exporting to ONNX). Direct support for Pytorch 2 will be available at a later date.

MX3 (production) Drivers
  • The MX3 PCIe driver and the pre-production USB driver are now avilable for all supported platforms.

  • The MX3 (Production) PCIe driver for Windows has achieved official signing and WHQL certification, ensuring reliability and compatibility.

MX3 (production) Neural Compiler
  • Complete support for MX3 (production) chips has been implemented.

  • We’ve broadened our support for various neural network frameworks. For a comprehensive list of supported operators, please refer to the operator support page.

  • Introduction of new graph processing techniques to broaden model support and enhance chip performance.

  • Improved resource mapping for better chip performance, ensuring more efficient utilization of chip capabilities.

  • The autocropper feature has been improved for better usability.

  • Mixed-precision features have been ported to the new graph processor (e.g., -bt and --exp_auto_dp).

  • Enhanced user messaging and Command Line Interface (CLI) information for a better user experience.

  • Numerous optimizations and fixes have been implemented to improve stability and performance.

Accelerator APIs
  • The AsyncAccl Python API has been expanded to support binding multiple callbacks for parallel streaming scenarios, such as Video Management System (VMS) applications.

  • Introducing a new C++ Accl API, designed to replace the previous version. This new API is capable of binding multiple callbacks for parallel streaming scenarios, enhancing support for applications like VMS.

0.8.1 - 0.8.4
  • Compiler: Important bugfix for Softmax

  • mx_bench: Performance improvements

  • Compiler: Minor bugfixes for 16-bit quantization

  • Compiler: Important bugfix for the Invert instruction

v0.7
New Graph Processing Engine
  • The Frame I/F and Graph Processing layers of the Compiler have been completely rewritten.

  • This “Graph 2” effort brings with it:
    • Faster compile times

    • Wider operator support

    • More stability and model support

  • Graph 2 is now the default for Keras, TensorFlow, TFLite, and ONNX frameworks. PyTorch remains on Graph 1 for now.
    • We recommend exporting PyTorch models to ONNX to use the benefits of Graph 2

  • Some compiler flags have not yet been ported from Graph 1. The compiler will automatically switch to Graph 1 when these are detected.

Mapper Enhancements
  • The Mapping layer of the Compiler has seen extensive bug fixes.

  • Over 1000 models have now been compiled successfully on MX3.

Improved MX3 Production (Cascade+) Performance
  • The compiler has been greatly improved for the Cascade+ architecture, increasing performance and stability.

Automated Operator Support Page
Option to Exclude Simulator DFP
  • To decrease the file size of DFPs, the new --no_sim_dfp flag can be used to exclude simulator config info.

  • The DFP will still work with real MXA hardware, and will be a much smaller file.

Expanded Operator Support
  • Between the different frameworks, support for many new layers has been added, including:
    • GELU

    • FastGELU

    • Mish

    • Sqrt

    • Log

    • LayerNorm

    • Attention

    • MatMul

    • Depth-to-Space

    • TransposeDepthwiseConv2D

Windows Driver/Runtime Early Access
  • Windows-native versions of the Driver and Accl APIs are now available for MX3 preproduction EVB1/EVB2 boards!

v0.1 to v0.6

Removed so we don’t clutter up this page :-)