MultiStream YOLO26 Object Detection#
Introduction#
In this tutorial, we will show how to use the MxAccl Python API and C++ API to perform real-time object detection on multiple input streams. We will use the YOLO26-nano model for our demo.
Hint
This tutorial showcases SDK 2.2’s new Python API, which brings large performance improvements and a unified interface with C++.
See the YOLOv8S tutorial for a similar walkthrough with the old Python API.
Download and Run
Download
This tutorial provides a high-level overview of the application’s key components. To run the full application, download the complete code package and the compiled DFP. After downloading, refer to the Run section below for step-by-step instructions.
Run
Requirements
Ensure the latest, yet compatible, combination of packages:
pip install --extra-index-url https://developer.memryx.com/pip memryx ultralytics 'opencv-python~=4.11.0'
Run
python run_yolo26n_multistream_objectdetection.py
Requirements
sudo apt-get install memx-accl memx-accl-plugins memx-utils-gui
Run
cd src/c++/
mkdir build
cd build
cmake ..
make
./multistream_objectdetection
1. Download the Model#
The YOLO26 pre-trained models are available on the Official YOLO26 GitHub page. For the sake of the tutorial, we exported the model for the user to download. The model can be found in the compressed folder multistream_od_yolo26.zip to this tutorial.
Steps are for explanation and learning
These step-by-step snippets are provided to explain the process and help you understand the concepts. For a complete, runnable version, please use the full scripts from the “Download & Run” section above.
2. Compile the Model#
The YOLO26-nano model was exported with the option to include a post-processing section in the model graph. Hence, it needed to be compiled with the neural compiler autocrop option. After the compilation, the compiler will generate the DFP file for the main section of the model (YOLO26_nano_640_640_3_onnx.dfp) and the cropped post-processing section of the model (YOLO26_nano_640_640_3_onnx_post.onnx). The compilation step is typically needed once and can be done using the Neural Compiler API or Tool.
Hint
You can use the pre-compiled DFP and post-processing section multistream_od_yolo26.zip to this tutorial and skip the compilation step.
from memryx import NeuralCompiler
nc = NeuralCompiler(num_chips=4, models="YOLO26_nano_640_640_3_onnx.onnx", verbose=1, autocrop=True)
dfp = nc.run()
In your command line, you need to type,
mx_nc -v -m YOLO26_nano_640_640_3_onnx.onnx --autocrop -c 4
In your python code, you need to point the dfp via a generated file path,
dfp = "YOLO26_nano_640_640_3_onnx.dfp"
In your C++ code, you need to point the dfp via a generated file path,
fs::path model_path = "YOLO26_nano_640_640_3_onnx.dfp"
3. CV Pipelines#
In this tutorial, we will show end-to-end implementations for the CV graph. Here, the overlay and the output display will be part of the output function connected to the MultiStreamAsyncAccl API. The following flowchart shows the different parts of the pipeline. It should be noted that the input camera frame should be saved (queued) to be later overlayed and displayed.
4. CV Initializations#
We will import the needed libraries, initialize the CV pipeline, and define common variables in this step.
import time
import argparse
import numpy as np
import cv2
from queue import Queue, Full
from threading import Thread
from matplotlib import pyplot as plt
from memryx.mxapi import MxAccl
# Stream-related containers
self.streams = []
self.streams_idx = [True] * self.num_streams
self.stream_window = [False] * self.num_streams
self.cap_queue = {i: Queue(maxsize=4) for i in range(self.num_streams)}
self.dets_queue = {i: Queue(maxsize=5) for i in range(self.num_streams)}
self.outputs = {i: [] for i in range(self.num_streams)}
# Initialize video captures, models, and dimensions for each streams
for i, video_path in enumerate(video_paths):
if "/dev/video" in video_path:
self.srcs_are_cams[i] = True
else:
self.srcs_are_cams[i] = False
vidcap = cv2.VideoCapture(video_path)
self.streams.append(vidcap)
self.dims[i] = (int(vidcap.get(cv2.CAP_PROP_FRAME_WIDTH)),
int(vidcap.get(cv2.CAP_PROP_FRAME_HEIGHT)))
self.color_wheel[i] = np.random.randint(0, 255, (20, 3)).astype(np.int32)
# Initialize the model with the stream dimensions
self.model[i] = YoloModel(stream_img_size=(self.dims[i][1], self.dims[i][0], 3))
#include <opencv2/opencv.hpp> /* imshow */
#include <opencv2/imgproc.hpp> /* cvtcolor */
#include <opencv2/imgcodecs.hpp> /* imwrite */
//Application Variables
std::deque<cv::Mat> frames_queue;
std::mutex frame_queue_mutex;
std::chrono::milliseconds start_ms;
5. Model Pre-/Post-Processing#
The pre-/post-processing steps are typically provided by the model authors and are outside of the scope of this tutorial. We provided a helper class with the tutorial compressed folder that implements the pre- and post-processing of YOLO26, and the user can check it for their reference. You can use the helper class as follows,
from yolov26 import YoloV26 as YoloModel
# Initialize the model with the stream dimensions
self.model[i] = YoloModel(stream_img_size=(self.dims[i][1], self.dims[i][0], 3))
The YoloModel.preprocess function takes the input frame and scales it to [0,1], as required by YOLO:
def preprocess(self, img):
h0, w0 = img.shape[:2]
self.orig_shape = (h0, w0)
r = self.input_size[0] / max(h0, w0)
if r != 1:
interp = cv2.INTER_AREA if r < 1 else cv2.INTER_LINEAR
img = cv2.resize(img, (int(w0 * r), int(h0 * r)), interpolation=interp)
img, _, dwdh = self._letterbox(img, new_shape=self.input_size, auto=False)
img = img.astype(np.float32) / 255.0
# Always store these because postprocess needs them
self.ratio = r
self.pad = dwdh
# add Z singleton dimension as expected by MxAccl: HWC -> HWZC
img = np.expand_dims(img, axis=2)
return img
By passing the cropped model section to the accl.connect_post_model() later, the output from the chip will automatically be run through the cropped section before being returned to the output callback function.
Then, the YoloModel.postprocess function will translate the raw output to bounding boxes, with coordinates adjusted to the original input image resolution:
def postprocess(self, fmap):
post_output = fmap[0]
if len(post_output) == 0:
return []
detections = post_output[0]
dets = []
h0, w0 = self.orig_shape
for detection in detections:
x1, y1, x2, y2, confidence, class_id = detection
if confidence < 0.4:
continue
# Remove padding, then scale back to original image
x1 = (x1 - self.pad[0]) / self.ratio
x2 = (x2 - self.pad[0]) / self.ratio
y1 = (y1 - self.pad[1]) / self.ratio
y2 = (y2 - self.pad[1]) / self.ratio
# Clip to image bounds
x1 = max(0, min(x1, w0 - 1))
x2 = max(0, min(x2, w0 - 1))
y1 = max(0, min(y1, h0 - 1))
y2 = max(0, min(y2, h0 - 1))
if x2 <= x1 or y2 <= y1:
continue
dets.append({
'bbox': (int(x1), int(y1), int(x2), int(y2)),
'class': COCO_CLASSES[int(class_id)],
'class_idx': int(class_id),
'score': float(confidence)
})
return dets
cv::Mat preprocess(cv::Mat& image) {
cv::Mat resizedImage;
cv::resize(image, resizedImage, cv::Size(model_input_width, model_input_height));
// Convert image to float32 and normalize
cv::Mat floatImage;
resizedImage.convertTo(floatImage, CV_32F, 1.0 / 255.0);
return floatImage;
}
void draw_bounding_box(cv::Mat& image, std::vector<detectedObj>& detections_vector) {
for (int i = 0; i < detections_vector.size(); ++i) {
detectedObj detected_object = detections_vector[i];
cv::rectangle(image, cv::Point(detected_object.x1, detected_object.y1), cv::Point(detected_object.x2, detected_object.y2), cv::Scalar(0, 255, 0), 2);
cv::putText(image, class_names.at(detected_object.obj_id),
cv::Point(detected_object.x1, detected_object.y1 - 3), cv::FONT_ITALIC,
0.8, cv::Scalar(255, 255, 255), 2);
cv::putText(image, std::to_string(detected_object.accuracy),
cv::Point(detected_object.x1, detected_object.y1 + 30), cv::FONT_ITALIC,
0.8, cv::Scalar(255, 255, 0), 2);
}
}
std::vector<detectedObj> get_detections(float* output, int num_boxes) {
std::vector<detectedObj> detections;
// YOLOv26 output format: (1, 300, 6)
// [x1, y1, x2, y2, confidence, class_id]
for (int i = 0; i < num_boxes; i++) {
float x1 = output[i * 6 + 0];
float y1 = output[i * 6 + 1];
float x2 = output[i * 6 + 2];
float y2 = output[i * 6 + 3];
float accuracy = output[i * 6 + 4];
int classPrediction = static_cast<int>(output[i * 6 + 5]);
if (accuracy < conf_thresh) {
continue;
}
// Skip clearly invalid boxes
if (x2 <= x1 || y2 <= y1) {
continue;
}
// Scale from model input space to original image space
x1 = (x1 / model_input_width) * input_image_width;
x2 = (x2 / model_input_width) * input_image_width;
y1 = (y1 / model_input_height) * input_image_height;
y2 = (y2 / model_input_height) * input_image_height;
// Reject boxes completely outside the image
if (x2 <= 0 || y2 <= 0 || x1 >= input_image_width || y1 >= input_image_height) {
continue;
}
// Clip boxes to image boundaries
x1 = std::max(0.0f, std::min(x1, static_cast<float>(input_image_width - 1)));
y1 = std::max(0.0f, std::min(y1, static_cast<float>(input_image_height - 1)));
x2 = std::max(0.0f, std::min(x2, static_cast<float>(input_image_width - 1)));
y2 = std::max(0.0f, std::min(y2, static_cast<float>(input_image_height - 1)));
// Reject boxes that became invalid after clipping
if (x2 <= x1 || y2 <= y1) {
continue;
}
detectedObj obj(x1, x2, y1, y2, classPrediction, accuracy);
detections.push_back(obj);
}
return detections;
}
The accl->connect_post_model(); will automatically retrieve the output from the chip, apply the cropped graph post-processing section using the ONNX runtime, and generate the final output.
// Connecting the post-processing model obtained from the autocrop of neural compiler to get the final output.
// The second parameter is required as the output shape of this particular post-processing model is variable
// and accl requires to know maximum possible size of the output. In this case it is (max_possible_boxes * size_of_box = 300 * 6 = 1800).
accl.connect_post_model(fs::path(postprocessing_model_path), 0, std::vector<size_t>{300 * 6});
In this case, we will use the ONNX runtime, since the model is in the ONNX format. User can also write their own post-processing in the absence of onnxruntime availability in their system.
6. Define an Input Function#
We need to define an input function for the accelerator to use. In this case, our input function will get a new frame from the cam and pre-process it.
def capture_and_preprocess(self, stream_id):
while True:
got_frame, frame = self.streams[stream_id].read()
if not got_frame or self.done:
self.streams_idx[stream_id] = False
return None
if self.srcs_are_cams[stream_id] and self.cap_queue[stream_id].full():
# drop frame
continue
else:
try:
# Put the frame in the cap_queue to be processed later
self.cap_queue[stream_id].put(frame, timeout=2)
# OpenCV is BGR, need to convert to RGB before feeding into the model
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
# Pre-process the frame using the corresponding model
frame_rgb = self.model[stream_id].preprocess(frame_rgb)
return frame_rgb
except Full:
print('Dropped frame')
continue
bool incallback_getframe(std::vector<const MX::Types::FeatureMap*> dst, int streamLabel) {
if (runflag.load()) {
cv::Mat inframe;
cv::Mat rgbImage;
while (true) {
bool got_frame = vcap.read(inframe);
if (!got_frame) {
std::cout << "No frame \n\n\n";
vcap.release();
return false; // return false if frame retrieval fails/stream is done sending input
}
if (src_is_cam && (frames_queue.size() >= FRAME_QUEUE_MAX_LENGTH)) {
// drop the frame and try again if we've hit the limit
continue;
}
else {
cv::cvtColor(inframe, rgbImage, cv::COLOR_BGR2RGB);
{
std::lock_guard<std::mutex> ilock(frame_queue_mutex);
frames_queue.push_back(rgbImage);
}
}
// Preprocess frame
cv::Mat preProcframe = preprocess(rgbImage);
// Set preprocessed input data to be sent to accelarator
dst[0]->set_data((float*)preProcframe.data);
return true;
}
}
else {
vcap.release();
return false;// Manually stopped the application so returning false to stop input.
}
}
Note
In the above code, method preprocess is used as a pre-processing step. This method can be found as a part of the full code file.
7. Define Output Functions#
We also need to define an out function for the accelerator to use. Our output function will post-process the accelerator output and display it on the screen.
The output function will collect the outputs and postprocess them:
def postprocess(self, mxa_output, stream_id):
dets = self.model[stream_id].postprocess(mxa_output)
# Push the detection results to the queue
self.dets_queue[stream_id].put(dets)
# Calculate the FPS
self.dt_array[stream_id][self.dt_index[stream_id]] = time.time() - self.frame_end_time[stream_id]
self.dt_index[stream_id] += 1
if self.dt_index[stream_id] % 15 == 0:
self.fps[stream_id] = 1 / np.average(self.dt_array[stream_id])
if self.dt_index[stream_id] >= 30:
self.dt_index[stream_id] = 0
self.frame_end_time[stream_id] = time.time()
Then the drawing thread will pop these results and overlay them on the final image:
def display(self):
while not self.done:
# Iterate over each stream to handle multiple displays
for stream_idx in range(self.num_streams):
# Check if the queues for frames and detections have data
if not self.cap_queue[stream_idx].empty() and not self.dets_queue[stream_idx].empty():
frame = self.cap_queue[stream_idx].get()
dets = self.dets_queue[stream_idx].get()
self.cap_queue[stream_idx].task_done()
self.dets_queue[stream_idx].task_done()
# Draw detection boxes on the frame
for d in dets:
l, t, r, b = d['bbox']
color = tuple(int(c) for c in self.color_wheel[stream_idx][d['class_idx'] % 20])
frame = cv2.rectangle(frame, (l, t), (r, b), color, 2)
frame = cv2.rectangle(frame, (l, t - 18), (r, t), color, -1)
frame = cv2.putText(frame, d['class'], (l + 2, t - 5),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 2)
# Add FPS information to the frame
fps_text = f"{self.model[stream_idx].name} - {self.fps[stream_idx]:.1f} FPS" if self.fps[stream_idx] > 1 else self.model[stream_idx].name
frame = cv2.putText(frame, fps_text, (50, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 0, 0), 2)
# Show the frame in a unique window for each stream
if self.show:
window_name = f"Stream {stream_idx} - YOLOv26"
cv2.imshow(window_name, frame)
# Exit on key press (applies to all streams)
if cv2.waitKey(1) == ord('q'):
self.done = True
# When done, destroy all windows and release resources
cv2.destroyAllWindows()
for stream in self.streams:
stream.release()
The output function will also overlay and display the output frame besides the MXA data collection and post-processing.
bool outcallback_getmxaoutput(std::vector<const MX::Types::FeatureMap*> src, int streamLabel) {
//Ouput from the post-processing model is a vector of size 1
//So copying only the first featuremap
if(mxa_output != nullptr) // for cleaner shutdowns
if(src[0] != nullptr) // for cleaner shutdowns
src[0]->get_data(mxa_output);
else
return false;
else
return false;
{
std::lock_guard<std::mutex> ilock(frame_queue_mutex);
// pop from frame queue
displayImage = frames_queue.front();
frames_queue.pop_front();
}// releases in frame queue lock
//Get the detections from model output
std::vector<detectedObj> detected_objectVector = get_detections(mxa_output, num_boxes);
// draw boundign boxes
draw_bounding_box(displayImage, detected_objectVector);
// using mx QT util to update the display frame
gui_->screens[0]->SetDisplayFrame(streamLabel, &displayImage, fps_number);
//Calulate FPS once every AVG_FPS_CALC_FRAME_COUNT frames
frame_count++;
if (frame_count == 1)
{
start_ms = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());
}
else if (frame_count % AVG_FPS_CALC_FRAME_COUNT == 0)
{
std::chrono::milliseconds duration =
std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()) - start_ms;
fps_number = (float)AVG_FPS_CALC_FRAME_COUNT * 1000 / (float)(duration.count());
frame_count = 0;
}
return true;
}
8. Connect the Accelerator#
The run function creates the accelerator, starts it with the specified number of streams, and waits for it to finish.
def run(self):
print("dfp path = ", self.dfp_path)
# Optimize performance by keeping inputs in opencv's HWC order
# by setting use_model_shape to [False (inputs), True (outputs)].
# Also, use local_mode=True for faster performance.
accl = MxAccl(self.dfp_path, [0], [False,True], True)
accl.connect_post_model(self.postmodel_path)
print("YOLOv26 inference on MX3 started")
self.display_thread.start()
start_time = time.time()
# Connect the input and output functions and let the accl run
for i in range(self.num_streams):
accl.connect_stream(self.capture_and_preprocess, self.postprocess, stream_id=i)
accl.start()
accl.wait()
self.done = True
# Join the display thread
self.display_thread.join()
The main() function Creates the accelerator, Yolo26 object and starts the acceleartor and waits for it to finish
int main(int argc, char* argv[]) {
signal(SIGINT, signal_handler);
vector<string> video_src_list;
std::string video_str = "cam:0";
/* [iterate through the arguments set set variables here] */
// if video_paths arg isn't passed - use default video string.
if (video_src_list.size() == 0) {
video_src_list.push_back(video_str);
}
// Initialize the MemryX accelerator, with use_model_shape [false,true] to avoid needing NHWC/NCHW input transpose
MX::Runtime::MxAccl accl{ fs::path(model_path), {0}, {false,true} };
// Connecting the post-processing model obtained from the autocrop of neural compiler to get the final output.
// The second parameter is required as the output shape of this particular post-processing model is variable
// and accl requires to know maximum possible size of the output. In this case it is (max_possible_boxes * size_of_box = 300 * 6 = 1800).
accl.connect_post_model(fs::path(postprocessing_model_path), 0, std::vector<size_t>{300 * 6});
// Creating GuiView which is a memryx qt util for easy display
MxQt gui(argc, argv);
// Setting the layout of the display based on number of input streams. Full screen mode only when more than one stream
if (video_src_list.size() == 1)
gui.screens[0]->SetSquareLayout(1, false);
else
gui.screens[0]->SetSquareLayout(static_cast<int>(video_src_list.size()));
//Creating a YoloV26 object for each stream which also connects the corresponding stream to accl.
std::vector<YoloV26*>yolo_objs;
for (int i = 0; i < video_src_list.size(); ++i) {
YoloV26* obj = new YoloV26(&accl, video_src_list[i], &gui, i);
yolo_objs.push_back(obj);
}
//Run the accelerator and wait
accl.start();
gui.Run(); //This command waits for exit to be pressed in Qt window
accl.stop();
//Cleanup
for (int i = 0; i < video_src_list.size(); ++i) {
delete yolo_objs[i];
}
}
The Yolo26() constructor connects the input stream to the accelartor.
YoloV26(MX::Runtime::MxAccl* accl, std::string video_src, MxQt* gui, int index) {
//Assigning gui variable to class specifc variable
gui_ = gui;
// If the input is a camera, try to use optimal settings
if (video_src.substr(0, 3) == "cam") {
int device = std::stoi(video_src.substr(4));
src_is_cam = true;
#ifdef __linux__
if (!openCamera(vcap, device, cv::CAP_V4L2)) {
throw(std::runtime_error("Failed to open: " + video_src));
}
#elif defined(_WIN32)
if (!openCamera(vcap, device, cv::CAP_ANY)) {
throw(std::runtime_error("Failed to open: " + video_src));
}
#endif
}
else if (video_src.substr(0, 3) == "vid") {
std::cout << "Video source given = " << video_src.substr(4) << "\n\n";
vcap.open(video_src.substr(4), cv::CAP_ANY);
src_is_cam = false;
}
else {
throw(std::runtime_error("Given video src: " + video_src + " is invalid" +
"\n\n\tUse ./objectDetection cam:<camera index>,vid:<path to video file>,cam:<camera index>,vid:<path to video file>\n\n"));
}
if (!vcap.isOpened()) {
std::cout << "videocapture for " << video_src << " is NOT opened, \n try giving full absolete paths for video files and correct camera index for cmameras \n";
runflag.store(false);
}
// Getting input image dimensions
input_image_width = static_cast<int>(vcap.get(cv::CAP_PROP_FRAME_WIDTH));
input_image_height = static_cast<int>(vcap.get(cv::CAP_PROP_FRAME_HEIGHT));
model_info = accl->get_model_info(0);//Getting model info of 0th model which is the only model in this DFP
mxa_output = new float[num_boxes * 6];//Creating the memory of output (max_boxes X num_box_parameters) - YOLOv26 has 6 params per box
//Getting model input shapes and display size
model_input_height = model_info.in_featuremap_shapes[0][0];
model_input_width = model_info.in_featuremap_shapes[0][1];
//Connecting the stream to the accl object. As the callback functions are defined as part of the class
//YoloV26 we should bind them with the possible input parameters
auto in_cb = std::bind(&YoloV26::incallback_getframe, this, std::placeholders::_1, std::placeholders::_2);
auto out_cb = std::bind(&YoloV26::outcallback_getmxaoutput, this, std::placeholders::_1, std::placeholders::_2);
accl->connect_stream(in_cb, out_cb, index/**Unique Stream Idx */, 0/**Model Idx */);
//Starts the callbacks when the call is started
runflag.store(true);
}
Third-Party Licenses#
This tutorial uses third-party software, models, and libraries. Below are the details of the licenses for these dependencies:
Model: YOLO26-nano from GitHub
License: AGPLv3
Summary#
This tutorial showed how to use a MultiStreamAccelerator API to run a multistream real-time inference using an object-detection model.
See also