YOLOv7t Object Detection#

Introduction#

In this tutorial, we will show how to use the AsyncAccl Python API to perform real-time object detection on MX3. We will use the YOLOv7-tiny model for our demo.

Note

This tutorial assumes a four-chip solution is correctly connected.

The following section provides a step-by-step explanation of the necessary steps to run the application successfully.

1. Download the Model#

The YOLOv7 pre-trained models are available on the Official YOLOv7 GitHub page. For the sake of the tutorial, we exported and compiled the model for the user to download; it can be found in the compressed folder attached to this tutorial.

2. Compile the Model#

The YOLOv7-tiny model was exported with the option to include a post-processing section in the model graph. Hence, it needed to be compiled with the Neural Compiler --autocrop option. After the compilation, the compiler will generate the DFP file for the main section of the model (YOLO_v7_tiny_416_416_3_onnx.dfp) and the cropped post-processing section of the model (YOLO_v7_tiny_416_416_3_onnx_post.onnx). The compilation step is typically needed once and can be done using the Neural Compiler API or Tool.

Hint

You can use the pre-compiled DFP and post-processing section attached to this tutorial and skip the compilation step.

API

from memryx import NeuralCompiler
nc = NeuralCompiler(num_chips=4, models="YOLO_v7_tiny_416_416_3_onnx.onnx", verbose=1, autocrop=True)
dfp = nc.run()

CLI Tool

In your command line, you need to type,

# note that we've renamed the model to YOLO_v7_tiny_416_416_3_onnx for consistency with the tutorial code
mx_nc -m YOLO_v7_tiny_416_416_3_onnx -v --autocrop

This will produce a DFP file ready to be used by the accelerator. In your Python code, you need to point the dfp variable to the generated file path,

dfp = "YOLO_v7_tiny_416_416_3_onnx.dfp"

Pre-Compiled DFP

In your Python code, you need to point the dfp variable to the generated file path,

dfp = "YOLO_v7_tiny_416_416_3_onnx.dfp"

3. CV Pipelines#

In this tutorial, we will show two different end-to-end implementations for the CV graph. The first one is the Sequential Display option. In that case, the overlay and the output display will be part of the output function connected to the AsyncAccl API. This implementation is simple, but having the overlay part of the output function might limit the inference performance on some systems. The following flowchart shows the different parts of the pipeline. It should be noted that the input camera frame should be saved (queued) to be later overlayed and displayed.

graph LR input([Input Function]) --> accl[Accelerator] accl --> output([Output Function]) input.->q[[Frames Queue]] q .-> output style input fill:#CFE8FD, stroke:#595959 style accl fill:#FFE699, stroke:#595959 style output fill:#A9D18E, stroke:#595959 style q fill:#dbd9d3, stroke:#595959

The second option is the Threaded Display, where the overlay/display will have its own parallel thread. In that case, the display thread will collect the frames and the detections from the queues and display them independently, as shown in the following flowchart.

graph LR subgraph Inference input([Input Function]) --> accl[Accelerator] accl --> output([Output Function]) end input.->fq[[Frames Queue]] output .-> dq[[Dets Queue]] subgraph Display display([Display Thread]) end fq .-> display dq .-> display style input fill:#CFE8FD, stroke:#595959 style accl fill:#FFE699, stroke:#595959 style output fill:#A9D18E, stroke:#595959 style fq fill:#dbd9d3, stroke:#595959 style dq fill:#dbd9d3, stroke:#595959

4. CV Initializations#

Import the needed libraries, initialize the CV pipeline, and define common variables in this step.

import time
import argparse
import numpy as np
import cv2
from queue import Queue, Full
from threading import Thread
from matplotlib import pyplot as plt
from memryx import AsyncAccl
from yolov7 import YoloV7Tiny as YoloModel

# CV and Queues
self.num_frames = 0
self.cap_queue = Queue(maxsize=4)
self.dets_queue = Queue(maxsize=5)
if "/dev/video" in str(video_path):
    self.src_is_cam = True
else:
    self.src_is_cam = False
self.vidcap = cv2.VideoCapture(video_path) 
self.dims = ( int(self.vidcap.get(cv2.CAP_PROP_FRAME_WIDTH)), 
        int(self.vidcap.get(cv2.CAP_PROP_FRAME_HEIGHT)) )

5. Model Pre-/Post-Processing#

The pre-/post-processing steps are typically provided by the model authors and are outside of the scope of this tutorial. We provided a helper class with the tutorial compressed folder that implements the pre- and post-processing of YOLOv7, and the user can check it for their reference. You can use the helper class as follows,

# Model
self.model = YoloModel(stream_img_size=(self.dims[1],self.dims[0],3))

The accl.set_postprocessing_model will automatically retrieve the output from the chip, apply the cropped graph post-processing section using the ONNX runtime, and generate the final output.

accl.set_postprocessing_model('YOLO_v7_tiny_416_416_3_onnx_post.onnx', model_idx=0)

After that, output can then be sent to the post-processing code in the YOLOv7 helper class to get the detection on the output image.

6. Define an Input Function#

We need to define an input function for the accelerator to use. In this case, our input function will get a new frame from the cam and pre-process it.

def capture_and_preprocess(self):
    """
    Captures a frame for the video device and pre-processes it.
    """

    while True:

        got_frame, frame = self.vidcap.read()

        if not got_frame:
            return None


        if self.src_is_cam and self.cap_queue.full():
            # drop the frame and try again
            continue
        else:
            self.num_frames += 1
            
            # Put the frame in the cap_queue to be overlayed later
            self.cap_queue.put(frame)
            
            # Preporcess frame
            frame = self.model.preprocess(frame)
            return frame

7. Define Output Functions#

We also need to define an out function for the accelerator to use. Our output function will post-process the accelerator output and display it on the screen.

The output function will also overlay and display the output frame besides the MXA data collection and post-processing.

def postprocess(self, *mxa_output):
    """
    Post-process the MXA output.
    """
     # Post-process the MXA ouptut
    dets = self.model.postprocess(mxa_output)

    # Push the results to the queue to be used by the display_save thread
    self.dets_queue.put(dets)

    # Calculate current FPS
    self.dt_array[self.dt_index] = time.time() - self.frame_end_time
    self.dt_index +=1
    
    if self.dt_index % 15 == 0:
        self.fps = 1 / np.average(self.dt_array)

        if self.dt_index >= 30:
            self.dt_index = 0
    
    self.frame_end_time = time.time()

8. Connect the Accelerator#

Now we need to connect your input and output functions to the AsyncAccl API. The API will take care of the rest.

from memryx import AsyncAccl

accl = AsyncAccl(dfp=self.dfp_path)
accl.set_postprocessing_model(self.postmodel_path, model_idx=0)
accl.connect_input(self.capture_and_preprocess)
accl.connect_output(self.postprocess)
accl.wait()

The accelerator will automatically call the connected input and output functions in a fully pipelined fashion.

Third-Party Licenses#

This tutorial uses third-party software, models, and libraries. Below are the details of the licenses for these dependencies:

Model: Yolov7 Tiny from GitHub
- License: GPL
Code and Pre/Post-Processing: Some code components, including pre/post-processing, were sourced from their GitHub
- License: GPL

Summary#

This tutorial showed how to use the AsyncAccl Python API to run a real-time inference using an object-detection model.