MultiStream YOLOv7t Object Detection#

Introduction#

In this tutorial, we will show how to use the MultiStreamAcclerator API to perform multistream real-time object detection on MX3 to demostrate the MultiStream capability of the API. We will use the YOLOv7-tiny model for our demo.

Note

This tutorial assumes a four-chip solution is correctly connected.

Download the Model#

The YOLOv7 pre-trained models are available on the Official YOLOv7 GitHub page. For the sake of the tutorial, we exported the model for the user to download. The model can be found in the compressed folder object_detection_on_mx3_c++.zip to this tutorial.

Compile the Model#

The YOLOv7-tiny model was exported with the option to include a post-processing section in the model graph. Hence, it needed to be compiled with the neural compiler autocrop option. After the compilation, the compiler will generate the DFP file for the main section of the model (yolov7-tiny_416.dfp) and the cropped post-processing section of the model (yolov7-tiny_416.post.onnx). The compilation step is typically needed once and can be done using the Neural Compiler API or Tool.

Hint

You can use the pre-compiled DFP and post-processing section object_detection_on_mx3_c++.zip to this tutorial and skip the compilation step.

API

from memryx import NeuralCompiler
nc = NeuralCompiler(num_chips=4, models="yolov7-tiny_416.onnx", verbose=1, dfp_fname = "yolov7-tiny_416", autocrop=True)
dfp = nc.run()

CLI Tool

In your command line, you need to type,

mx_nc -v -m yolov7-tiny_416.onnx --autocrop -c 4

Pre-Compiled DFP

In your python code, you need to point the dfp via a generated file path,

dfp = "yolov7-tiny_416.dfp"

C++ Pre-Compiled DFP

In your C++ code, you need to point the dfp via a generated file path,

fs::path model_path = "yolov7-tiny_416.dfp";

CV Pipelines#

In this tutorial, we will show end-to-end implementations for the CV graph. Here, the overlay and the output display will be part of the output function connected to the MultiStreamAsyncAccl API. The following flowchart shows the different parts of the pipeline. It should be noted that the input camera frame should be saved (queued) to be later overlayed and displayed.

graph LR input([Input Function]) --> accl[Accelerator] accl --> output([Output Function]) input.->q[[Frames Queue]] q .-> output style input fill:#CFE8FD, stroke:#595959 style accl fill:#FFE699, stroke:#595959 style output fill:#A9D18E, stroke:#595959 style q fill:#dbd9d3, stroke:#595959

CV Initializations#

We will import the needed libraries, initialize the CV pipeline, and define common variables in this step.

python

import cv2
from queue import Queue, Full
from threading import Thread
from matplotlib import pyplot as plt

# Stream-related containers
self.streams = []
self.streams_idx = [True] * self.num_streams
self.stream_window = [False] * self.num_streams
self.cap_queue = {i: Queue(maxsize=10) for i in range(self.num_streams)}
self.dets_queue = {i: Queue(maxsize=10) for i in range(self.num_streams)}
self.outputs = {i: [] for i in range(self.num_streams)}

vidcap = cv2.VideoCapture(video_path)
self.streams.append(vidcap)

C++

#include <opencv2/opencv.hpp>    /* imshow */
#include <opencv2/imgproc.hpp>   /* cvtcolor */
#include <opencv2/imgcodecs.hpp> /* imwrite */
fs::path default_videoPath = "../video.mp4"; 

std::deque<cv::Mat> frames_queue;
std::mutex frame_queue_mutex;
cv::VideoCapture vcap;

Model Pre-/Post-Processing#

The pre-/post-processing steps are typically provided by the model authors and are outside of the scope of this tutorial. We provided a helper class with the tutorial compressed folder that implements the pre- and post-processing of YOLOv7, and the user can check it for their reference. You can use the helper class as follows,

python

from yolov7 import YoloV7Tiny as YoloModel

# Initialize video captures, models, and dimensions for each streams
for i, video_path in enumerate(video_paths):
    # Initialize the model with the stream dimensions
    self.model[i] = YoloModel(stream_img_size=(self.dims[i][1], self.dims[i][0], 3))

The accl.set_postprocessing_model() will automatically retrieve the output from the chip, apply the cropped graph post-processing section using the ONNX runtime, and generate the final output.

accl.set_postprocessing_model('yolov7-tiny_416.post.onnx', model_idx=0)

In this case, we will use the ONNX runtime, since the model is in the ONNX format. User can also write their own post-processing in the absence of onnxruntime availability in their system.

After that, output can then be sent to the post-processing code in the YOLOv7 helper class to get the detection on the output image. The class can be found as a part of the full code file.

C++

cv::Mat preprocess( cv::Mat& image ) {

    cv::Mat resizedImage;
    cv::resize(image, resizedImage, cv::Size(model_input_height, model_input_width), cv::INTER_LINEAR);

    // Convert image to float32 and normalize
    cv::Mat floatImage;
    resizedImage.convertTo(floatImage, CV_32F, 1.0 / 255.0);

    return floatImage;
}

void draw_bounding_box(cv::Mat& image, std::vector<detectedObj>& detections_vector){
    for(int i=0;i<detections_vector.size();++i ) {
        detectedObj detected_object = detections_vector[i];
        cv::rectangle(image, cv::Point(detected_object.x1, detected_object.y1), cv::Point(detected_object.x2, detected_object.y2), cv::Scalar(0, 255, 0), 2);

        cv::putText(image, class_names.at( detected_object.obj_id ),
                    cv::Point(detected_object.x1, detected_object.y1 - 3), cv::FONT_ITALIC,
                    0.8, cv::Scalar(255, 255, 255), 2);

        cv::putText(image, std::to_string(detected_object.accuracy),
                    cv::Point(detected_object.x1, detected_object.y1+30), cv::FONT_ITALIC,
                    0.8, cv::Scalar(255, 255, 0), 2);
    }
}

std::vector<detectedObj> get_detections(float* output, int num_boxes){
    std::vector<detectedObj> detections;
    for (int i = 0; i < num_boxes; i++) {
        //Decoding model output
        float accuracy          = output[i * 7 + 6];
        if(accuracy<conf_thresh){
            continue;
        }
        float x1                = output[i * 7 + 1];
        float y1                = output[i * 7 + 2];
        float x2                = output[i * 7 + 3];
        float y2                = output[i * 7 + 4];
        int classPrediction     = output[i * 7 + 5];

        // Coords should be scaled to the dispaly image. The coords from the model are relative to the model's input height and width.
        x1 = (x1  / model_input_width) * input_image_width ;
        x2 = (x2/ model_input_width) * input_image_width ;
        y1 = (y1 / model_input_height) * input_image_height ;
        y2 = (y2/ model_input_height) * input_image_height ;

        detectedObj obj( x1, x2, y1, y2, classPrediction, accuracy);

        detections.push_back( obj );
    }
    return detections;
}

The accl->connect_post_model(); will automatically retrieve the output from the chip, apply the cropped graph post-processing section using the ONNX runtime, and generate the final output.

// Connecting the post-processing model obtained from the autocrop of neural compiler to get the final output.
// The second parameter is required as the output shape of this particular post-processing model is variable
// and accl requires to know maximum possible size of the output. In this case it is (max_possible_boxes * size_of_box = 300 *7= 2100).
accl->connect_post_model(postprocessing_model_path,0,std::vector<size_t>{300*7});

In this case, we will use the ONNX runtime, since the model is in the ONNX format. User can also write their own post-processing in the absence of onnxruntime availability in their system.

Define an Input Function#

We need to define an input function for the accelerator to use. In this case, our input function will get a new frame from the cam and pre-process it.

python

def capture_and_preprocess(self, stream_idx):
    """
    Captures a frame for the video device and pre-processes it.
    """
    got_frame, frame = self.streams[stream_idx].read()

    if not got_frame:
        self.streams_idx[stream_idx] = False
        return None

    try:
        # Put the frame in the cap_queue to be processed later
        self.cap_queue[stream_idx].put(frame, timeout=2)

        # Pre-process the frame using the corresponding model
        frame = self.model[stream_idx].preprocess(frame)
        return frame

    except Full:
        print('Dropped frame .. exiting')
        return None

Note

In the above code, method preprocess is used as pre-processing step. This method can be found as a part of the full code file.

C++

bool incallback_getframe(vector<const MX::Types::FeatureMap<float>*> dst, int streamLabel){

    if(runflag.load()){
        cv::Mat inframe;
        cv::Mat rgbImage;
        bool got_frame = vcap.read(inframe);

        if (!got_frame) {
            std::cout << "No frame \n\n\n";
            vcap.release();
            return false;  // return false if frame retrieval fails/stream is done sending input
        }
        cv::cvtColor(inframe, rgbImage, cv::COLOR_BGR2RGB);
        {
            std::lock_guard<std::mutex> ilock(frame_queue_mutex);
            frames_queue.push_back(rgbImage);
        }
        // Preprocess frame
        cv::Mat preProcframe = preprocess(rgbImage);
        // Set preprocessed input data to be sent to accelarator
        dst[0]->set_data((float*)preProcframe.data, false);

        return true;
    }
    else{
        vcap.release();
        return false;// Manually stopped the application so returning false to stop input.
    }    
}

Note

In the above code, method preprocess is used as pre-processing step. This method can be found as a part of the full code file.

Define Output Functions#

We also need to define an out function for the accelerator to use. Our output function will post-process the accelerator output and display it on the screen.

python

The output function will also overlay and display the output frame besides the MXA data collection and post-processing.

def postprocess(self, stream_idx, *mxa_output):
    """
    Post-process the MXA output.
    """
    dets = self.model[stream_idx].postprocess(mxa_output)

    # Push the detection results to the queue
    self.dets_queue[stream_idx].put(dets)

    # Calculate the FPS
    self.dt_array[stream_idx][self.dt_index[stream_idx]] = time.time() - self.frame_end_time[stream_idx]
    self.dt_index[stream_idx] += 1

    if self.dt_index[stream_idx] % 15 == 0:
        self.fps[stream_idx] = 1 / np.average(self.dt_array[stream_idx])

    if self.dt_index[stream_idx] >= 30:
        self.dt_index[stream_idx] = 0

    self.frame_end_time[stream_idx] = time.time()

Note

In the above code, method postprocess is used as post-processing step. This method can be found as a part of the full code file.

C++

The output function will also overlay and display the output frame besides the MXA data collection and post-processing.

bool outcallback_getmxaoutput(vector<const MX::Types::FeatureMap<float>*> src, int streamLabel){
    
    //Output from the post-processing model is a vector of size 1
    //So copying only the first featuremap
    src[0]->get_data(mxa_output);

    {
        std::lock_guard<std::mutex> ilock(frame_queue_mutex);
        // pop from frame queue
        displayImage = frames_queue.front();
        frames_queue.pop_front();
    }// releases in frame queue lock

    //Get the detections from model output
    std::vector<detectedObj> detected_objectVector = get_detections(mxa_output, num_boxes);
    
    // draw bounding boxes
    draw_bounding_box(displayImage, detected_objectVector );

    // using mx QT util to update the display frame
    gui_->screens[0]->SetDisplayFrame(streamLabel,&displayImage,fps_number);

    //Calulate FPS once every AVG_FPS_CALC_FRAME_COUNT frames     
    frame_count++;
    if (frame_count == 1)
    {
        start_ms = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());
    }
    else if (frame_count % AVG_FPS_CALC_FRAME_COUNT == 0)
    {
        std::chrono::milliseconds duration =
            std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()) - start_ms;
        fps_number = (float)AVG_FPS_CALC_FRAME_COUNT * 1000 / (float)(duration.count());
        frame_count = 0;
    } 
    return true;
}

Connect the Accelerator#

python

The run function creates the accelerator, starts it with the specified number of streams, and waits for it to finish.

def run(self):
    """
    The function that starts the inference on the MXA.
    """
    accl = MultiStreamAsyncAccl(dfp='yolov7-tiny_416.dfp')
    print("YOLOv7-Tiny inference on MX3 started")
    accl.set_postprocessing_model('yolov7-tiny_416.post.onnx', model_idx=0)

    self.display_thread.start()

    start_time = time.time()

    # Connect the input and output functions and let the accl run
    accl.connect_streams(self.capture_and_preprocess, self.postprocess, self.num_streams)
    accl.wait()

    self.done = True

    # Join the display thread
    self.display_thread.join()

C++

The main() function Creates the accelerator, YoloV7 object and starts the acceleartor and waits for it to finish

//Create the Accl object and load the DFP
MX::Runtime::MxAccl* accl = new MX::Runtime::MxAccl();
accl->connect_dfp(model_path.c_str());


// Connecting the post-processing model obtained from the autocrop of neural compiler to get the final output.
// The second parameter is required as the output shape of this particular post-processing model is variable
// and accl requires to know maximum possible size of the output. In this case it is (max_possible_boxes * size_of_box = 300 *7= 2100).
accl->connect_post_model(postprocessing_model_path,0,std::vector<size_t>{300*7});

//Creating a YoloV7 object for each stream which also connects the corresponding stream to accl.
std::vector<YoloV7*>yolo_objs;
for(int i =0; i<video_src_list.size();++i){
    YoloV7* obj = new YoloV7(accl,video_src_list[i],&gui,i);
    yolo_objs.push_back(obj);
}

//Run the accelerator and wait
accl->start();
gui.Run(); //This command waits for exit to be pressed in Qt window
accl->stop();

The YoloV7() constructor connects the input stream to the accelartor.

YoloV7(MX::Runtime::MxAccl* accl, std::string video_src, MxQt* gui, int index){

    model_info = accl->get_model_info(0);//Getting model info of 0th model which is the only model in this DFP
    mxa_output= new float[num_boxes*7];//Creating the memory of output (max_boxes X num_box_parameters) 

    //Getting model input shapes and display size
    model_input_height = model_info.in_featuremap_shapes[0][0];
    model_input_width = model_info.in_featuremap_shapes[0][1];

    //Connecting the stream to the accl object. As the callback functions are defined as part of the class
    //YoloV7 we should bind them with the possible input parameters
    auto in_cb = std::bind(&YoloV7::incallback_getframe, this, std::placeholders::_1, std::placeholders::_2);
    auto out_cb = std::bind(&YoloV7::outcallback_getmxaoutput, this, std::placeholders::_1, std::placeholders::_2);
    accl->connect_stream(in_cb, out_cb, index/**Unique Stream Idx */, 0/**Model Idx */);
}

Third-Party Licenses#

This tutorial uses third-party software, models, and libraries. Below are the details of the licenses for these dependencies:

Model: Yolov7 Tiny from GitHub
- License: GPL
Code and Pre/Post-Processing: Some code components, including pre/post-processing, were sourced from their GitHub
- License: GPL

Summary#

This tutorial showed how to use a MultiStreamAccelerator API to run a multistream real-time inference using an object-detection model. The code and the resources used in the tutorial are available to download:

python

object_detection_multistream_python_on_mx3.zip

C++

object_detection_on_mx3_c++.zip