MultiStream YOLOv8S Object Detection#
Introduction#
In this tutorial, we will show how to use the MultiStreamAcclerator Python API and MxAccl C++ API to perform multistream real-time object detection on MX3 to demostrate multi-stream capability. We will use the YOLOv8S model for our demo.
Note
This tutorial assumes a four-chip solution is correctly connected.
Download & Run
Download
This tutorial provides a high-level overview of the application’s key components. To run the full application, download the complete code package and the compiled DFP. After downloading, refer to the Run section below for step-by-step instructions.
Run
Requirements
Install the required Python packages:
pip install opencv-python==4.11.0.86
pip install pyyaml==6.0.2
pip install pandas==2.3.1
pip install ultralytics==8.3.161
Run Command
Execute the Python script:
python run_objectiondetection.py
Requirements
Ensure that all Memx runtime plugins and utility libraries are installed. For detailed installation instructions, refer to the Installation Guide.
Install the required packages:
sudo apt-get install memx-accl memx-accl-plugins memx-utils-gui
Run
Build and execute the C++ application:
cd src/cpp/
mkdir build
cd build
cmake ..
make
./multistream_objectdetection_yolov8s
1. Download the Model#
The YOLOv8S pre-trained models are available on the Official YOLOv8S GitHub page. For the sake of the tutorial, we exported both TFLite and ONNX models for the user to download. The pre-compiled model can be found in the compressed folder multistream_od_yolov8.zip
to this tutorial.
2. Compile the Model#
The YOLOv8S model was exported with the option to include a post-processing section in the model graph. Hence, it needed to be compiled with the neural compiler autocrop
option.
After the compilation, the compiler will generate the DFP file for the main section of the model (yolov8s.dfp
) and
the cropped post-processing section of the model (yolov8s_post.tflite or yolov8s_post.onnx
).
The compilation step is typically needed once and can be done using the Neural Compiler API or Tool.
Hint
You can use the pre-compiled DFP and post-processing section multistream_od_yolov8.zip
to this tutorial and skip the compilation step.
from memryx import NeuralCompiler
nc = NeuralCompiler(num_chips=4, models="YOLO_v8_small_640_640_3_tflite.tflite", verbose=1, autocrop=True)
dfp = nc.run()
In your command line, you need to type,
mx_nc -v -m YOLO_v8_small_640_640_3_tflite.tflite --autocrop -c 4
In your python code, you need to point the dfp
via a generated file path,
dfp = "YOLO_v8_small_640_640_3_tflite.dfp"
In your C++ code, you need to point the dfp
via a generated file path,
fs::path model_path = "YOLO_v8_small_640_640_3_tflite.dfp"
from memryx import NeuralCompiler
nc = NeuralCompiler(num_chips=4, models="YOLO_v8_small_640_640_3_onnx.onnx", verbose=1, autocrop=True)
dfp = nc.run()
In your command line, you need to type,
mx_nc -v -m YOLO_v8_small_640_640_3_onnx.onnx --autocrop -c 4
In your python code, you need to point the dfp
via a generated file path,
dfp = "YOLO_v8_small_640_640_3_onnx.dfp"
In your C++ code, you need to point the dfp
via a generated file path,
fs::path model_path = "YOLO_v8_small_640_640_3_onnx.dfp"
Note
YOLOv8s can be run using both TFLite and ONNX. Only the path needs to be changed in the script.
3. CV Pipelines#
In this tutorial, we will show end-to-end implementations for the CV graph. Here, the overlay and the output display will be part of the output function connected to the MultiStreamAsyncAccl API. The following flowchart shows the different parts of the pipeline. It should be noted that the input camera frame should be saved (queued) to be later overlayed and displayed.
4. CV Initializations#
We will import the needed libraries, initialize the CV pipeline, and define common variables in this step.
import time
import argparse
import numpy as np
import cv2
from queue import Queue, Full
from threading import Thread
from memryx import MultiStreamAsyncAccl
from yolov8 import YoloV8 as YoloModel
# Stream-related containers and initialization
self.streams = []
self.streams_idx = [True] * self.num_streams
self.stream_window = [False] * self.num_streams
self.cap_queue = {i: Queue(maxsize=4) for i in range(self.num_streams)}
self.dets_queue = {i: Queue(maxsize=5) for i in range(self.num_streams)}
self.outputs = {i: [] for i in range(self.num_streams)}
vidcap = cv2.VideoCapture(video_path)
self.streams.append(vidcap)
#include <opencv2/opencv.hpp> /* imshow */
#include <opencv2/imgproc.hpp> /* cvtcolor */
#include <opencv2/imgcodecs.hpp> /* imwrite */
// Application Variables
std::deque<cv::Mat> frames_queue; // Queue for frames
cv::VideoCapture vcap; // Video capture object
5. Model Pre-/Post-Processing#
The pre-/post-processing steps are typically provided by the model authors and are outside of the scope of this tutorial. We provided a helper class with the tutorial compressed folder that implements the pre- and post-processing of YOLOv8, and the user can check it for their reference. You can use the helper class as follows,
from yolov8 import YoloV8 as YoloModel
# Initialize video captures, models, and dimensions for each stream
for i, video_path in enumerate(video_paths):
vidcap = cv2.VideoCapture(video_path)
# Initialize the YOLOv8 model
self.model[i] = YoloModel(stream_img_size=(self.dims[i][1], self.dims[i][0], 3), model_type=self.model_type)
The accl.set_postprocessing_model()
function will automatically retrieve the output from the chip, apply the cropped graph post-processing section using the TFLite/ONNX runtime if you pass the corresponding TFLite/ONNX model, and generate the final output.
accl.set_postprocessing_model(self.post_model, model_idx=0) # Set the post-processing model
In this case, we will use the TFLite runtime since the model passed is in TFLite format. Users can also write their own post-processing if TFLite/ONNX Runtime is not available on their system.
After that, output can then be sent to the post-processing code in the YOLOv8 helper class to get the detection on the output image. The class can be found as a part of the full code file.
// Function to preprocess the input image (resize and normalize)
cv::Mat preprocess(cv::Mat& image) {
// Get original image dimensions
int img_height = image.rows;
int img_width = image.cols;
// Determine the size of the square image (longest side)
length = std::max(img_height, img_width);
// Create a black square image with the same number of channels
cv::Mat squareImage = cv::Mat::zeros(cv::Size(length, length), image.type());
// Copy the original image into the square image
image.copyTo(squareImage(cv::Rect(0, 0, img_width, img_height)));
// Resize to 640x640 for the model input
cv::Mat resizedImage;
if (model_type == "tflite") {
cv::resize(squareImage, resizedImage, cv::Size(640, 640), cv::INTER_LINEAR);
} else {
cv::dnn::blobFromImage(squareImage, resizedImage, 1.0, cv::Size(640, 640), cv::Scalar(0, 0, 0), true, false);
}
// Convert to float32 and normalize pixel values (0-1 range)
resizedImage.convertTo(resizedImage, CV_32F, 1.0 / 255.0);
// Convert HWC (Height, Width, Channels) to CHW (Channels, Height, Width)
std::vector<cv::Mat> channels(3);
cv::split(resizedImage, channels);
cv::Mat chwImage;
cv::merge(channels, chwImage);
return chwImage;
}
// Function to process model output and get bounding boxes
std::vector<Box> get_detections(float* ofmap, int num_boxes, const cv::Mat& inframe) {
std::vector<Box> all_boxes;
std::vector<cv::Rect> cv_boxes;
std::vector<float> all_scores;
std::vector<Box> filtered_boxes;
// Precompute scaling factors once
const float y_factor = static_cast<float>(length) / static_cast<float>(model_input_height);
const float x_factor = static_cast<float>(length) / static_cast<float>(model_input_width);
// Iterate through the model outputs
for (int i = 0; i < num_boxes; ++i) {
// Extract and scale the bounding box coordinates
float x0 = ofmap[i];
float y0 = ofmap[num_boxes + i];
float w = ofmap[2 * num_boxes + i];
float h = ofmap[3 * num_boxes + i];
x0 *= x_factor;
y0 *= y_factor;
w *= x_factor;
h *= y_factor;
int x1 = static_cast<int>(x0 - 0.5f * w);
int y1 = static_cast<int>(y0 - 0.5f * h);
int x2 = static_cast<int>(x0 + 0.5f * w);
int y2 = static_cast<int>(y0 + 0.5f * h);
// Loop through classes and apply confidence threshold
for (int j = 4; j < features_per_box; ++j) {
float confidence = ofmap[j * num_boxes + i];
if (confidence > conf_thresh) {
// Add detected box to the list
Box box;
box.x1 = x1;
box.y1 = y1;
box.x2 = x2;
box.y2 = y2;
box.class_id = j - 4;
box.confidence = confidence;
all_boxes.push_back(box);
all_scores.push_back(confidence);
cv_boxes.emplace_back(cv::Rect(x1, y1, x2 - x1, y2 - y1));
}
}
}
// Apply Non-Maximum Suppression (NMS) to remove overlapping boxes
std::vector<int> nms_result;
if (!cv_boxes.empty()) {
cv::dnn::NMSBoxes(cv_boxes, all_scores, conf_thresh, nms_thresh, nms_result);
for (int idx : nms_result) {
filtered_boxes.push_back(all_boxes[idx]);
}
}
return filtered_boxes;
}
// Function to draw bounding boxes on the image
void draw_bounding_boxes(cv::Mat& image, const std::vector<Box>& boxes) {
for (const Box& box : boxes) {
// Draw bounding box
cv::rectangle(image, cv::Point(box.x1, box.y1), cv::Point(box.x2, box.y2), cv::Scalar(0, 255, 0), 2);
// Add class label and confidence score
std::string label = class_names[box.class_id];
int baseLine = 0;
cv::Size labelSize = cv::getTextSize(label, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);
cv::putText(image, label, cv::Point(box.x1, box.y1 - labelSize.height), cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(255, 255, 255), 1);
}
}
The accl->connect_post_model();
function will automatically retrieve the output from the chip, apply the cropped graph post-processing section using the TFLite/ONNX runtime if you pass the corresponding TFLite/ONNX model, and generate the final output.
}
In this case, we will use the TFLite runtime since the model passed is in TFLite format. Users can also write their own post-processing if TFLite/ONNX Runtime is not available on their system.
6. Define an Input Function#
We need to define an input function for the accelerator to use. In this case, our input function will get a new frame from the cam and pre-process it.
def capture_and_preprocess(self, stream_idx):
"""
Captures a frame for the video device and pre-processes it.
"""
# if self.srcs_are_cams[stream_idx]:
while True:
got_frame, frame = self.streams[stream_idx].read()
if not got_frame or self.done:
self.streams_idx[stream_idx] = False
return None
if self.srcs_are_cams[stream_idx] and self.cap_queue[stream_idx].full():
# drop frame
continue
else:
try:
# Put the frame in the cap_queue to be processed later
self.cap_queue[stream_idx].put(frame, timeout=2)
# Pre-process the frame using the corresponding model
frame = self.model[stream_idx].preprocess(frame)
return frame
except Full:
print('Dropped frame')
continue
Note
In the above code, method preprocess is used as a pre-processing step. This method can be found as a part of the full code file.
bool incallback_getframe(std::vector<const MX::Types::FeatureMap*> dst, int streamLabel) {
if (runflag.load()) {
cv::Mat inframe;
cv::Mat rgbImage;
while(true){
bool got_frame = vcap.read(inframe); // Capture frame
if (!got_frame) { // If no frame, stop the stream
std::cout << "No frame \n\n\n";
vcap.release();
return false;
}
if(src_is_cam && (frames_queue.size() >= FRAME_QUEUE_MAX_LENGTH)){
// drop the frame and try again if we've hit the limit
continue;
}
else{
// Convert frame to RGB and store in queue
cv::cvtColor(inframe, rgbImage, cv::COLOR_BGR2RGB);
{
std::lock_guard<std::mutex> ilock(frame_queue_mutex);
frames_queue.push_back(rgbImage);
}
}
// Preprocess frame and set data for inference
cv::Mat preProcframe = preprocess(rgbImage);
dst[0]->set_data((float*)preProcframe.data);
return true;
}
}
else {
vcap.release();
return false;
}
}
Note
In the above code, method preprocess is used as a pre-processing step. This method can be found as a part of the full code file.
7. Define Output Functions#
We also need to define an out function for the accelerator to use. Our output function will post-process the accelerator output and display it on the screen.
The output function will also overlay and display the output frame besides the MXA data collection and post-processing.
def postprocess(self, stream_idx, *mxa_output):
"""
Post-process the output from MXA.
"""
dets = self.model[stream_idx].postprocess(mxa_output) # Get detection results
# Queue detection results for display
self.dets_queue[stream_idx].put(dets)
# Calculate FPS
self.dt_array[stream_idx][self.dt_index[stream_idx]] = time.time() - self.frame_end_time[stream_idx]
self.dt_index[stream_idx] += 1
if self.dt_index[stream_idx] % 15 == 0:
self.fps[stream_idx] = 1 / np.average(self.dt_array[stream_idx])
if self.dt_index[stream_idx] >= 30:
self.dt_index[stream_idx] = 0
self.frame_end_time[stream_idx] = time.time()
Note
In the above code, method postprocess is used as post-processing step. This method can be found as a part of the full code file.
The output function will also overlay and display the output frame besides the MXA data collection and post-processing.
// Output callback function to process MXA output and display results
bool outcallback_getmxaoutput(std::vector<const MX::Types::FeatureMap*> src, int streamLabel) {
src[0]->get_data(mxa_output); // Get the output data from MXA
{
std::lock_guard<std::mutex> ilock(frame_queue_mutex);
displayImage = frames_queue.front();
frames_queue.pop_front();
}
// Get detected boxes and draw them on the image
std::vector<Box> detected_objectVector = get_detections(mxa_output, num_boxes, displayImage);
draw_bounding_boxes(displayImage, detected_objectVector);
// Display the updated image in the GUI
gui_->screens[0]->SetDisplayFrame(streamLabel, &displayImage, fps_number);
// Calculate FPS
frame_count++;
if (frame_count == 1) {
start_ms = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());
}
else if (frame_count % AVG_FPS_CALC_FRAME_COUNT == 0) {
std::chrono::milliseconds duration = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()) - start_ms;
fps_number = (float)AVG_FPS_CALC_FRAME_COUNT * 1000 / (float)(duration.count());
frame_count = 0;
}
return true;
}
8. Connect the Accelerator#
The run function creates the accelerator, starts it with the specified number of streams, and waits for it to finish.
def run(self):
"""
Start inference on the MXA using multiple streams.
"""
accl = MultiStreamAsyncAccl(dfp=self.dfp) # Initialize the accelerator with DFP
print("YOLOv8s inference on MX3 started")
accl.set_postprocessing_model(self.post_model, model_idx=0) # Set the post-processing model
self.display_thread.start() # Start the display thread
start_time = time.time()
# Connect input and output streams for the accelerator
accl.connect_streams(self.capture_and_preprocess, self.postprocess, self.num_streams)
accl.wait()
self.done = True
# Join display thread
self.display_thread.join()
The main() function Creates the accelerator, YoloV8 object and starts the acceleartor and waits for it to finish
// Create the Accl object and load the DFP model
MX::Runtime::MxAccl accl{fs::path(model_path)};
// Connect the post-processing model
accl.connect_post_model(fs::path(postprocessing_model_path));
// Creating YoloV8 objects for each video stream
std::vector<YoloV8*> yolo_objs;
for (int i = 0; i < video_src_list.size(); ++i) {
YoloV8* obj = new YoloV8(&accl, video_src_list[i], model_type, &gui, i);
yolo_objs.push_back(obj);
}
// Run the accelerator and wait
accl.start();
gui.Run(); // Wait until the exit button is pressed in the Qt window
accl.stop();
The YoloV8() constructor connects the input stream to the accelerator.
// Constructor to initialize YOLOv8 object
YoloV8(MX::Runtime::MxAccl* accl, std::string video_src, std::string model_type, MxQt* gui, int index) {
// Get model info and allocate output buffer
model_info = accl->get_model_info(0);
mxa_output = new float[num_boxes * features_per_box]; // Allocate memory for model output
// Get model input dimensions
model_input_height = model_info.in_featuremap_shapes[0][0];
model_input_width = model_info.in_featuremap_shapes[0][1];
// Bind input/output callback functions
auto in_cb = std::bind(&YoloV8::incallback_getframe, this, std::placeholders::_1, std::placeholders::_2);
auto out_cb = std::bind(&YoloV8::outcallback_getmxaoutput, this, std::placeholders::_1, std::placeholders::_2);
// Connect streams to the accelerator
accl->connect_stream(in_cb, out_cb, index /**Unique Stream Idx */, 0 /**Model Idx */);
Third-Party License#
This tutorial uses third-party software, models, and libraries. Below are the details of the licenses for these dependencies:
Model: Yolov8 from Ultralytics GitHub
License: AGPL
Code and Pre/Post-Processing: Some code components, including pre/post-processing, were sourced from their GitHub
License: AGPL
Summary#
This tutorial showed how to use a MultiStreamAccelerator API to run a multistream real-time inference using an object-detection model.
See also