MultiStream YOLO26 Object Detection#

Introduction#

In this tutorial, we will show how to use the MxAccl Python API and C++ API to perform real-time object detection on multiple input streams. We will use the YOLO26-nano model for our demo.

Hint

This tutorial showcases SDK 2.2’s new Python API, which brings large performance improvements and a unified interface with C++.

See the YOLOv8S tutorial for a similar walkthrough with the old Python API.

Download and Run

Download

This tutorial provides a high-level overview of the application’s key components. To run the full application, download the complete code package and the compiled DFP. After downloading, refer to the Run section below for step-by-step instructions.

Run

Requirements

Ensure the latest, yet compatible, combination of packages:

pip install --extra-index-url https://developer.memryx.com/pip memryx ultralytics 'opencv-python~=4.11.0'

Run

python run_yolo26n_multistream_objectdetection.py

Requirements

sudo apt-get install memx-accl memx-accl-plugins memx-utils-gui

Run

cd src/c++/
mkdir build
cd build
cmake ..
make
./multistream_objectdetection

1. Download the Model#

The YOLO26 pre-trained models are available on the Official YOLO26 GitHub page. For the sake of the tutorial, we exported the model for the user to download. The model can be found in the compressed folder multistream_od_yolo26.zip to this tutorial.

Steps are for explanation and learning

These step-by-step snippets are provided to explain the process and help you understand the concepts. For a complete, runnable version, please use the full scripts from the “Download & Run” section above.

2. Compile the Model#

The YOLO26-nano model was exported with the option to include a post-processing section in the model graph. Hence, it needed to be compiled with the neural compiler autocrop option. After the compilation, the compiler will generate the DFP file for the main section of the model (YOLO26_nano_640_640_3_onnx.dfp) and the cropped post-processing section of the model (YOLO26_nano_640_640_3_onnx_post.onnx). The compilation step is typically needed once and can be done using the Neural Compiler API or Tool.

Hint

You can use the pre-compiled DFP and post-processing section multistream_od_yolo26.zip to this tutorial and skip the compilation step.

from memryx import NeuralCompiler  
nc = NeuralCompiler(num_chips=4, models="YOLO26_nano_640_640_3_onnx.onnx", verbose=1, autocrop=True)
dfp = nc.run()

In your command line, you need to type,

mx_nc -v -m YOLO26_nano_640_640_3_onnx.onnx --autocrop -c 4

In your python code, you need to point the dfp via a generated file path,

dfp = "YOLO26_nano_640_640_3_onnx.dfp"

In your C++ code, you need to point the dfp via a generated file path,

fs::path model_path = "YOLO26_nano_640_640_3_onnx.dfp"

3. CV Pipelines#

In this tutorial, we will show end-to-end implementations for the CV graph. Here, the overlay and the output display will be part of the output function connected to the MultiStreamAsyncAccl API. The following flowchart shows the different parts of the pipeline. It should be noted that the input camera frame should be saved (queued) to be later overlayed and displayed.

graph LR input([Input Function]) --> accl[Accelerator] accl --> output([Output Function]) input.->q[[Frames Queue]] q .-> output style input fill:#CFE8FD, stroke:#595959 style accl fill:#FFE699, stroke:#595959 style output fill:#A9D18E, stroke:#595959 style q fill:#dbd9d3, stroke:#595959

4. CV Initializations#

We will import the needed libraries, initialize the CV pipeline, and define common variables in this step.

import time
import argparse
import numpy as np
import cv2
from queue import Queue, Full
from threading import Thread
from matplotlib import pyplot as plt
from memryx.mxapi import MxAccl
# Stream-related containers
self.streams = []
self.streams_idx = [True] * self.num_streams
self.stream_window = [False] * self.num_streams
self.cap_queue = {i: Queue(maxsize=4) for i in range(self.num_streams)}
self.dets_queue = {i: Queue(maxsize=5) for i in range(self.num_streams)}
self.outputs = {i: [] for i in range(self.num_streams)}
# Initialize video captures, models, and dimensions for each streams
for i, video_path in enumerate(video_paths):
    if "/dev/video" in video_path:
        self.srcs_are_cams[i] = True
    else:
        self.srcs_are_cams[i] = False

    vidcap = cv2.VideoCapture(video_path)
    self.streams.append(vidcap)

    self.dims[i] = (int(vidcap.get(cv2.CAP_PROP_FRAME_WIDTH)),
                    int(vidcap.get(cv2.CAP_PROP_FRAME_HEIGHT)))
    self.color_wheel[i] = np.random.randint(0, 255, (20, 3)).astype(np.int32)

    # Initialize the model with the stream dimensions
    self.model[i] = YoloModel(stream_img_size=(self.dims[i][1], self.dims[i][0], 3))
#include <opencv2/opencv.hpp>    /* imshow */
#include <opencv2/imgproc.hpp>   /* cvtcolor */
#include <opencv2/imgcodecs.hpp> /* imwrite */
//Application Variables
std::deque<cv::Mat> frames_queue;
std::mutex frame_queue_mutex;
std::chrono::milliseconds start_ms;

5. Model Pre-/Post-Processing#

The pre-/post-processing steps are typically provided by the model authors and are outside of the scope of this tutorial. We provided a helper class with the tutorial compressed folder that implements the pre- and post-processing of YOLO26, and the user can check it for their reference. You can use the helper class as follows,

from yolov26 import YoloV26 as YoloModel
# Initialize the model with the stream dimensions
self.model[i] = YoloModel(stream_img_size=(self.dims[i][1], self.dims[i][0], 3))

The YoloModel.preprocess function takes the input frame and scales it to [0,1], as required by YOLO:

def preprocess(self, img):
    h0, w0 = img.shape[:2]
    self.orig_shape = (h0, w0)

    r = self.input_size[0] / max(h0, w0)
    if r != 1:
        interp = cv2.INTER_AREA if r < 1 else cv2.INTER_LINEAR
        img = cv2.resize(img, (int(w0 * r), int(h0 * r)), interpolation=interp)

    img, _, dwdh = self._letterbox(img, new_shape=self.input_size, auto=False)

    img = img.astype(np.float32) / 255.0

    # Always store these because postprocess needs them
    self.ratio = r
    self.pad = dwdh

    # add Z singleton dimension as expected by MxAccl: HWC -> HWZC
    img = np.expand_dims(img, axis=2)

    return img

By passing the cropped model section to the accl.connect_post_model() later, the output from the chip will automatically be run through the cropped section before being returned to the output callback function.

Then, the YoloModel.postprocess function will translate the raw output to bounding boxes, with coordinates adjusted to the original input image resolution:

def postprocess(self, fmap):
    post_output = fmap[0]

    if len(post_output) == 0:
        return []

    detections = post_output[0]
    dets = []

    h0, w0 = self.orig_shape

    for detection in detections:
        x1, y1, x2, y2, confidence, class_id = detection

        if confidence < 0.4:
            continue

        # Remove padding, then scale back to original image
        x1 = (x1 - self.pad[0]) / self.ratio
        x2 = (x2 - self.pad[0]) / self.ratio
        y1 = (y1 - self.pad[1]) / self.ratio
        y2 = (y2 - self.pad[1]) / self.ratio

        # Clip to image bounds
        x1 = max(0, min(x1, w0 - 1))
        x2 = max(0, min(x2, w0 - 1))
        y1 = max(0, min(y1, h0 - 1))
        y2 = max(0, min(y2, h0 - 1))

        if x2 <= x1 or y2 <= y1:
            continue

        dets.append({
            'bbox': (int(x1), int(y1), int(x2), int(y2)),
            'class': COCO_CLASSES[int(class_id)],
            'class_idx': int(class_id),
            'score': float(confidence)
        })

    return dets
cv::Mat preprocess(cv::Mat& image) {

    cv::Mat resizedImage;
    cv::resize(image, resizedImage, cv::Size(model_input_width, model_input_height));

    // Convert image to float32 and normalize
    cv::Mat floatImage;
    resizedImage.convertTo(floatImage, CV_32F, 1.0 / 255.0);

    return floatImage;
}

void draw_bounding_box(cv::Mat& image, std::vector<detectedObj>& detections_vector) {
    for (int i = 0; i < detections_vector.size(); ++i) {
        detectedObj detected_object = detections_vector[i];
        cv::rectangle(image, cv::Point(detected_object.x1, detected_object.y1), cv::Point(detected_object.x2, detected_object.y2), cv::Scalar(0, 255, 0), 2);

        cv::putText(image, class_names.at(detected_object.obj_id),
            cv::Point(detected_object.x1, detected_object.y1 - 3), cv::FONT_ITALIC,
            0.8, cv::Scalar(255, 255, 255), 2);

        cv::putText(image, std::to_string(detected_object.accuracy),
            cv::Point(detected_object.x1, detected_object.y1 + 30), cv::FONT_ITALIC,
            0.8, cv::Scalar(255, 255, 0), 2);
    }
}

std::vector<detectedObj> get_detections(float* output, int num_boxes) {
    std::vector<detectedObj> detections;

    // YOLOv26 output format: (1, 300, 6)
    // [x1, y1, x2, y2, confidence, class_id]
    for (int i = 0; i < num_boxes; i++) {
        float x1 = output[i * 6 + 0];
        float y1 = output[i * 6 + 1];
        float x2 = output[i * 6 + 2];
        float y2 = output[i * 6 + 3];
        float accuracy = output[i * 6 + 4];
        int classPrediction = static_cast<int>(output[i * 6 + 5]);

        if (accuracy < conf_thresh) {
            continue;
        }

        // Skip clearly invalid boxes
        if (x2 <= x1 || y2 <= y1) {
            continue;
        }

        // Scale from model input space to original image space
        x1 = (x1 / model_input_width) * input_image_width;
        x2 = (x2 / model_input_width) * input_image_width;
        y1 = (y1 / model_input_height) * input_image_height;
        y2 = (y2 / model_input_height) * input_image_height;

        // Reject boxes completely outside the image
        if (x2 <= 0 || y2 <= 0 || x1 >= input_image_width || y1 >= input_image_height) {
            continue;
        }

        // Clip boxes to image boundaries
        x1 = std::max(0.0f, std::min(x1, static_cast<float>(input_image_width - 1)));
        y1 = std::max(0.0f, std::min(y1, static_cast<float>(input_image_height - 1)));
        x2 = std::max(0.0f, std::min(x2, static_cast<float>(input_image_width - 1)));
        y2 = std::max(0.0f, std::min(y2, static_cast<float>(input_image_height - 1)));

        // Reject boxes that became invalid after clipping
        if (x2 <= x1 || y2 <= y1) {
            continue;
        }

        detectedObj obj(x1, x2, y1, y2, classPrediction, accuracy);
        detections.push_back(obj);
    }

    return detections;
}

The accl->connect_post_model(); will automatically retrieve the output from the chip, apply the cropped graph post-processing section using the ONNX runtime, and generate the final output.

// Connecting the post-processing model obtained from the autocrop of neural compiler to get the final output.
// The second parameter is required as the output shape of this particular post-processing model is variable
// and accl requires to know maximum possible size of the output. In this case it is (max_possible_boxes * size_of_box = 300 * 6 = 1800).
accl.connect_post_model(fs::path(postprocessing_model_path), 0, std::vector<size_t>{300 * 6});

In this case, we will use the ONNX runtime, since the model is in the ONNX format. User can also write their own post-processing in the absence of onnxruntime availability in their system.

6. Define an Input Function#

We need to define an input function for the accelerator to use. In this case, our input function will get a new frame from the cam and pre-process it.

def capture_and_preprocess(self, stream_id):
    while True:
        got_frame, frame = self.streams[stream_id].read()

        if not got_frame or self.done:
            self.streams_idx[stream_id] = False
            return None

        if self.srcs_are_cams[stream_id] and self.cap_queue[stream_id].full():
            # drop frame
            continue
        else:
            try:
                # Put the frame in the cap_queue to be processed later
                self.cap_queue[stream_id].put(frame, timeout=2)

                # OpenCV is BGR, need to convert to RGB before feeding into the model
                frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

                # Pre-process the frame using the corresponding model
                frame_rgb = self.model[stream_id].preprocess(frame_rgb)
                return frame_rgb

            except Full:
                print('Dropped frame')
                continue
bool incallback_getframe(std::vector<const MX::Types::FeatureMap*> dst, int streamLabel) {

    if (runflag.load()) {
        cv::Mat inframe;
        cv::Mat rgbImage;

        while (true) {
            bool got_frame = vcap.read(inframe);

            if (!got_frame) {
                std::cout << "No frame \n\n\n";
                vcap.release();
                return false;  // return false if frame retrieval fails/stream is done sending input
            }
            if (src_is_cam && (frames_queue.size() >= FRAME_QUEUE_MAX_LENGTH)) {
                // drop the frame and try again if we've hit the limit
                continue;
            }
            else {
                cv::cvtColor(inframe, rgbImage, cv::COLOR_BGR2RGB);
                {
                    std::lock_guard<std::mutex> ilock(frame_queue_mutex);
                    frames_queue.push_back(rgbImage);
                }
            }

            // Preprocess frame
            cv::Mat preProcframe = preprocess(rgbImage);
            // Set preprocessed input data to be sent to accelarator
            dst[0]->set_data((float*)preProcframe.data);

            return true;
        }
    }
    else {
        vcap.release();
        return false;// Manually stopped the application so returning false to stop input.
    }
}

Note

In the above code, method preprocess is used as a pre-processing step. This method can be found as a part of the full code file.

7. Define Output Functions#

We also need to define an out function for the accelerator to use. Our output function will post-process the accelerator output and display it on the screen.

The output function will collect the outputs and postprocess them:

def postprocess(self, mxa_output, stream_id):
    dets = self.model[stream_id].postprocess(mxa_output)

    # Push the detection results to the queue
    self.dets_queue[stream_id].put(dets)

    # Calculate the FPS
    self.dt_array[stream_id][self.dt_index[stream_id]] = time.time() - self.frame_end_time[stream_id]
    self.dt_index[stream_id] += 1

    if self.dt_index[stream_id] % 15 == 0:
        self.fps[stream_id] = 1 / np.average(self.dt_array[stream_id])

    if self.dt_index[stream_id] >= 30:
        self.dt_index[stream_id] = 0

    self.frame_end_time[stream_id] = time.time()

Then the drawing thread will pop these results and overlay them on the final image:

def display(self):
    while not self.done:
        # Iterate over each stream to handle multiple displays
        for stream_idx in range(self.num_streams):
            # Check if the queues for frames and detections have data
            if not self.cap_queue[stream_idx].empty() and not self.dets_queue[stream_idx].empty():
                frame = self.cap_queue[stream_idx].get()
                dets = self.dets_queue[stream_idx].get()

                self.cap_queue[stream_idx].task_done()
                self.dets_queue[stream_idx].task_done()

                # Draw detection boxes on the frame
                for d in dets:
                    l, t, r, b = d['bbox']
                    color = tuple(int(c) for c in self.color_wheel[stream_idx][d['class_idx'] % 20])
                    frame = cv2.rectangle(frame, (l, t), (r, b), color, 2)
                    frame = cv2.rectangle(frame, (l, t - 18), (r, t), color, -1)
                    frame = cv2.putText(frame, d['class'], (l + 2, t - 5),
                                        cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 2)

                # Add FPS information to the frame
                fps_text = f"{self.model[stream_idx].name} - {self.fps[stream_idx]:.1f} FPS" if self.fps[stream_idx] > 1 else self.model[stream_idx].name
                frame = cv2.putText(frame, fps_text, (50, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 0, 0), 2)

                # Show the frame in a unique window for each stream
                if self.show:
                    window_name = f"Stream {stream_idx} - YOLOv26"
                    cv2.imshow(window_name, frame)

        # Exit on key press (applies to all streams)
        if cv2.waitKey(1) == ord('q'):
            self.done = True

    # When done, destroy all windows and release resources
    cv2.destroyAllWindows()
    for stream in self.streams:
        stream.release()

The output function will also overlay and display the output frame besides the MXA data collection and post-processing.

bool outcallback_getmxaoutput(std::vector<const MX::Types::FeatureMap*> src, int streamLabel) {

    //Ouput from the post-processing model is a vector of size 1
    //So copying only the first featuremap
    if(mxa_output != nullptr) // for cleaner shutdowns 
        if(src[0] != nullptr) // for cleaner shutdowns
            src[0]->get_data(mxa_output);
        else
            return false;
    else
        return false;

    {
        std::lock_guard<std::mutex> ilock(frame_queue_mutex);
        // pop from frame queue
        displayImage = frames_queue.front();
        frames_queue.pop_front();
    }// releases in frame queue lock

    //Get the detections from model output
    std::vector<detectedObj> detected_objectVector = get_detections(mxa_output, num_boxes);

    // draw boundign boxes
    draw_bounding_box(displayImage, detected_objectVector);

    // using mx QT util to update the display frame
    gui_->screens[0]->SetDisplayFrame(streamLabel, &displayImage, fps_number);

    //Calulate FPS once every AVG_FPS_CALC_FRAME_COUNT frames     
    frame_count++;
    if (frame_count == 1)
    {
        start_ms = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());
    }
    else if (frame_count % AVG_FPS_CALC_FRAME_COUNT == 0)
    {
        std::chrono::milliseconds duration =
            std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()) - start_ms;
        fps_number = (float)AVG_FPS_CALC_FRAME_COUNT * 1000 / (float)(duration.count());
        frame_count = 0;
    }
    return true;
}

8. Connect the Accelerator#

The run function creates the accelerator, starts it with the specified number of streams, and waits for it to finish.

def run(self):
    print("dfp path = ", self.dfp_path)
    # Optimize performance by keeping inputs in opencv's HWC order
    # by setting use_model_shape to [False (inputs), True (outputs)].
    # Also, use local_mode=True for faster performance.
    accl = MxAccl(self.dfp_path, [0], [False,True], True)
    accl.connect_post_model(self.postmodel_path)
    print("YOLOv26 inference on MX3 started")

    self.display_thread.start()

    start_time = time.time()

    # Connect the input and output functions and let the accl run
    for i in range(self.num_streams):
        accl.connect_stream(self.capture_and_preprocess, self.postprocess, stream_id=i)

    accl.start()
    accl.wait()

    self.done = True

    # Join the display thread
    self.display_thread.join()

The main() function Creates the accelerator, Yolo26 object and starts the acceleartor and waits for it to finish

int main(int argc, char* argv[]) {
    signal(SIGINT, signal_handler);
    vector<string> video_src_list;

    std::string video_str = "cam:0";

    /* [iterate through the arguments set set variables here] */

    // if video_paths arg isn't passed - use default video string.
    if (video_src_list.size() == 0) {
        video_src_list.push_back(video_str);
    }

    // Initialize the MemryX accelerator, with use_model_shape [false,true] to avoid needing NHWC/NCHW input transpose
    MX::Runtime::MxAccl accl{ fs::path(model_path), {0}, {false,true} };

    // Connecting the post-processing model obtained from the autocrop of neural compiler to get the final output.
    // The second parameter is required as the output shape of this particular post-processing model is variable
    // and accl requires to know maximum possible size of the output. In this case it is (max_possible_boxes * size_of_box = 300 * 6 = 1800).
    accl.connect_post_model(fs::path(postprocessing_model_path), 0, std::vector<size_t>{300 * 6});

    // Creating GuiView which is a memryx qt util for easy display
    MxQt gui(argc, argv);
    // Setting the layout of the display based on number of input streams. Full screen mode only when more than one stream
    if (video_src_list.size() == 1)
        gui.screens[0]->SetSquareLayout(1, false);
    else
        gui.screens[0]->SetSquareLayout(static_cast<int>(video_src_list.size()));

    //Creating a YoloV26 object for each stream which also connects the corresponding stream to accl.
    std::vector<YoloV26*>yolo_objs;
    for (int i = 0; i < video_src_list.size(); ++i) {
        YoloV26* obj = new YoloV26(&accl, video_src_list[i], &gui, i);
        yolo_objs.push_back(obj);
    }

    //Run the accelerator and wait
    accl.start();
    gui.Run(); //This command waits for exit to be pressed in Qt window
    accl.stop();

    //Cleanup
    for (int i = 0; i < video_src_list.size(); ++i) {
        delete yolo_objs[i];
    }
}

The Yolo26() constructor connects the input stream to the accelartor.

YoloV26(MX::Runtime::MxAccl* accl, std::string video_src, MxQt* gui, int index) {
    //Assigning gui variable to class specifc variable
    gui_ = gui;
    // If the input is a camera, try to use optimal settings
    if (video_src.substr(0, 3) == "cam") {
        int device = std::stoi(video_src.substr(4));
        src_is_cam = true;
      #ifdef __linux__
        if (!openCamera(vcap, device, cv::CAP_V4L2)) {
            throw(std::runtime_error("Failed to open: " + video_src));
        }

      #elif defined(_WIN32)
        if (!openCamera(vcap, device, cv::CAP_ANY)) {
            throw(std::runtime_error("Failed to open: " + video_src));
        }
      #endif
    }
    else if (video_src.substr(0, 3) == "vid") {
        std::cout << "Video source given = " << video_src.substr(4) << "\n\n";
        vcap.open(video_src.substr(4), cv::CAP_ANY);
        src_is_cam = false;
    }
    else {
        throw(std::runtime_error("Given video src: " + video_src + " is invalid" +
            "\n\n\tUse ./objectDetection cam:<camera index>,vid:<path to video file>,cam:<camera index>,vid:<path to video file>\n\n"));
    }
    if (!vcap.isOpened()) {
        std::cout << "videocapture for " << video_src << " is NOT opened, \n try giving full absolete paths for video files and correct camera index for cmameras \n";
        runflag.store(false);
    }

    // Getting input image dimensions
    input_image_width = static_cast<int>(vcap.get(cv::CAP_PROP_FRAME_WIDTH));
    input_image_height = static_cast<int>(vcap.get(cv::CAP_PROP_FRAME_HEIGHT));

    model_info = accl->get_model_info(0);//Getting model info of 0th model which is the only model in this DFP
    mxa_output = new float[num_boxes * 6];//Creating the memory of output (max_boxes X num_box_parameters) - YOLOv26 has 6 params per box 

    //Getting model input shapes and display size
    model_input_height = model_info.in_featuremap_shapes[0][0];
    model_input_width = model_info.in_featuremap_shapes[0][1];

    //Connecting the stream to the accl object. As the callback functions are defined as part of the class
    //YoloV26 we should bind them with the possible input parameters
    auto in_cb = std::bind(&YoloV26::incallback_getframe, this, std::placeholders::_1, std::placeholders::_2);
    auto out_cb = std::bind(&YoloV26::outcallback_getmxaoutput, this, std::placeholders::_1, std::placeholders::_2);
    accl->connect_stream(in_cb, out_cb, index/**Unique Stream Idx */, 0/**Model Idx */);

    //Starts the callbacks when the call is started
    runflag.store(true);
}

Third-Party Licenses#

This tutorial uses third-party software, models, and libraries. Below are the details of the licenses for these dependencies:

Summary#

This tutorial showed how to use a MultiStreamAccelerator API to run a multistream real-time inference using an object-detection model.

See also