Closed Loop Compilation#

Note

This tutorial assumes that an M.2 MX3 is connected to the same machine where you are running the script.

Introduction#

By default, our compiler balances fast compilation and maximum performance. However, for users who seek the best possible performance, we provide the --effort hard option, which allows the compiler to spend additional time searching for the best-performing solution.

This tutorial explains the concept of “closed-loop effort=hard,” an advanced feature that further refines model performance through real FPS measurements, iterating over different configurations to find the optimal one.

Effort Hard Mode#

The --effort hard option enables the compiler to generate multiple candidate mapping points and select the most efficient one based on a cost function. This approach differs from the default compilation mode, which relies on heuristics for selecting a single mapping point. While --effort hard increases compilation time, it often results in better performance.

Introducing: Closed-Loop Compilation#

Closed-loop compilation enhances effort hard by incorporating real FPS benchmarks measured directly on the chip. By iterating through potential mapping points (called “splits”) and selecting the configuration with the highest FPS, closed-loop compilation can improve performance by an average of 3%. Some models may experience even greater gains.

This feature is not yet fully integrated into the compiler but can be executed manually using a provided script.

Hint

“Splits” refer to different sets of NeuralCompiler options and mapping points. They are not related to model croppping and do not modify the layers being compiled.

Step 1: Prepare Your Model#

Before running the closed-loop script, ensure you have a model ready. For demonstration purposes, we will use a YOLOv5-small-voc model, which can be obtained from a public source using the following command:

wget https://mmdeploy-oss.openmmlab.com/model/mmyolo/yolov5-660fed.onnx

Step 2: Run the Closed-Loop Script#

The closed-loop script automates the compilation and benchmarking process. It compiles the model at different splits and selects the best-performing executable. Run the script with the model file as an argument:

python3 closed_loop.py --models yolov5-660fed.onnx

The script performs the following steps:

  1. Compiles all unique splits generated by the compiler in effort hard mode to DFPs.

  2. Benchmarks each DFP on the connected MX3 hardware to measure FPS.

  3. Identifies and selects the split with the highest FPS.

  4. Copies the best-performing DFP file to best_split.dfp.

After execution, the script will output FPS results for each tested split. The split with the highest FPS is automatically selected and saved as best_split.dfp.

Full Options List#

Option

Description

-m MODEL [MODEL ...],
--models MODEL [MODEL ...]

Path(s) to the model file(s).

-j N,
--jobs N

Maximum number of concurrent compilation jobs (default: 4).

-k,
--keep-split-folders

Keep all compiled splits (default: False).

-pm MODEL_ID,
--prioritize-model MODEL_ID

When benchmarking splits for multi-model DFPs, whether to prioritize total FPS (-1, default) or a specific model only (0 for first model, 1 for second model, etc.).

-f N,
--frames N

Number of frames to run for each benchmark (default: 500).

-c N,
--num_chips N

Size (number of chips) of the target MXA device (default: 4).

-g N,
--device_id N

ID of the connected MXA device to use for compilation and benchmarking (default: 0).

Third-Party Licenses#

This tutorial utilizes a third-party model. Below are the license details for this dependency:

Summary#

Closed-loop compilation provides a powerful way to optimize neural network execution on the M.2 MX3. By benchmarking and selecting the best split configuration, users can achieve higher FPS with minimal manual effort. Future enhancements, including multi-process compilation and improved SDK integration, will further accelerate the process.

The full script is available for download: