Cartoonizer and Pose Estimation#

Introduction#

The Multi-DFP example demonstrates how two distinct DFPs (Data Flow Pipelines)—Cartoonizer and Pose Estimation—can be combined and run as a single application on MemryX accelerators using the Acclerator API. It uses a pre-trained YOLOv8-small-pose model and an open-source cartoonizer model for real-time inference.

Note

This tutorial assumes a four-chip solution is correctly connected.

This example demonstrates how to run two applications together efficiently. For standalone usage, refer to their individual examples: Cartoonizer and Pose Estimation.

1. Download the Model#

Pre-trained models for Cartoonizer and YOLOv8s-pose can be found on their respective GitHub pages: Cartoonizer model and Official YOLOv8s-pose.

For the sake of the tutorial, we exported and pre-compiled the model for the user to download. The model can be found in the compressed folder attached to this tutorial.

2. Compile the Model#

Note

You can use the pre-compiled DFP attached to this tutorial and skip the compilation step. Please, make sure to include it in your working folder.

The compilation step is typically needed once and can be done using the Neural Compiler API or Tool.

The YOLOv8s-pose model was exported with the option to include a post-processing section in the model graph.

CLI Tool

In your command line, you need to type,

mx_nc -v -m YOLO_v8_small_pose_640_640_3_onnx.onnx --autocrop -c 4

This will produce a DFP file ready to be used by the accelerator. In your Python code, you need to point the dfp, post_model variables to the generated file paths.

dfp = "YOLO_v8_small_pose_640_640_3_onnx.dfp"
post_model = "YOLO_v8_small_pose_640_640_3_onnx_post.onnx"

API

from memryx import NeuralCompiler
nc = NeuralCompiler(num_chips=4, models="YOLO_v8_small_pose_640_640_3_onnx.onnx", verbose=1, autocrop=True)
dfp = nc.run()

Cartoonizer Model

To get the Cartoonizer ONNX model, please follow these steps. After obtaining the ONNX model, you can compile it as shown below.

CLI Tool

In your command line, you need to type,

mx_nc -v -m facial-cartoonizer_512.onnx --autocrop -c 4

This will produce a DFP file ready to be used by the accelerator. In your Python code, you need to point the dfp variables to the generated file paths.

cartoon_dfp = "facial-cartoonizer_512.dfp"

API

from memryx import NeuralCompiler
nc = NeuralCompiler(num_chips=4, models="facial-cartoonizer_512.onnx", verbose=1)
dfp = nc.run()

3. CV Initializations#

We will import the needed libraries and intialize the CV pipeline in this step.

Python

import os
import sys
from pathlib import Path
import cv2 as cv
import numpy as np
from collections import deque
import argparse
from multiprocessing import Process, Queue, Event
from memryx import AsyncAccl
from apps import Cartoonizer, PoseEstmiation
from PyQt5.QtWidgets import QApplication
from displayer import Displayer
from memryx.runtime import SchedulerOptions, ClientOptions

This function reads frames from a single camera source and places them into two separate queues, allowing the same frames to be passed concurrently to both the Cartoonizer and Pose Estimation accelerators.

def shared_capture_loop(src, queue1, queue2, stop_flag):
    cap = cv.VideoCapture(src)
    # Check if the source is a camera or video file
    is_cam = isinstance(src, int) or (isinstance(src, str) and src.startswith('/dev/video'))

    while not stop_flag.is_set():
        ret, frame = cap.read()
        if not ret:
            break
        for q in [queue1, queue2]:
            if not q.full():
                q.put(frame.copy())
    cap.release()

    if not is_cam:
        stop_flag.set()

C++

#include <iostream>
#include <thread>
#include <atomic>
#include <mutex>
#include <csignal>
#include <queue>
#include <condition_variable>
#include <opencv2/opencv.hpp>

#include "memx/accl/MxAccl.h"
#include <memx/mxutils/gui_view.h>
#include "PoseApp.h"
#include "CartoonApp.h"

// Frame queues and sync
std::queue<cv::Mat> cartoon_queue;
std::queue<cv::Mat> pose_queue;
std::mutex cartoon_mutex, pose_mutex;
std::condition_variable cartoon_cv, pose_cv;

// Runtime state
std::atomic_bool runflag(true);
cv::VideoCapture vcap;

// Signal handler to safely stop
void signalHandler(int signum) {
    runflag.store(false);
}

const size_t MAX_QUEUE_SIZE = 20;  // You can adjust this as needed

This function reads frames from a single camera source and places them into two separate queues, allowing the same frames to be passed concurrently to both the Cartoonizer and Pose Estimation accelerators. If the input is a video file, the application waits until the queues have been fully processed by the accelerators before exiting.

void frameProducer(cv::VideoCapture& cap, std::atomic_bool& runflag, bool is_video) {
    cv::Mat frame;

    while (runflag.load()) {
        if (!cap.read(frame)) {
            std::cerr << "[Producer] End of video or failed to read frame.\n";
            break;
        }

        {
            std::lock_guard<std::mutex> lock(cartoon_mutex);
            cartoon_queue.push(frame.clone());
            cartoon_cv.notify_one();
        }

        {
            std::lock_guard<std::mutex> lock(pose_mutex);
            pose_queue.push(frame.clone());
            pose_cv.notify_one();
        }
    }

    if (is_video) {
        std::cout << "[Producer] Waiting for queues to flush before exiting...\n";

        while (runflag.load()) {
            std::unique_lock<std::mutex> lock1(cartoon_mutex, std::defer_lock);
            std::unique_lock<std::mutex> lock2(pose_mutex, std::defer_lock);
            std::lock(lock1, lock2);

            if (cartoon_queue.empty() && pose_queue.empty())
                break;

            lock1.unlock();
            lock2.unlock();
            std::this_thread::sleep_for(std::chrono::milliseconds(100));
        }

        std::cout << "[Producer] Flush wait done or interrupted. Proceeding to shutdown...\n";
    }

    runflag.store(false);
    cartoon_cv.notify_all();
    pose_cv.notify_all();
}

4. Define an Input Function#

We need to define an input function for each accelerator (Cartoonizer and Pose). In this case, the input function retrieves a new frame from the queue, pre-processes it, and returns the image to the accelerator.

Python

Cartoonizer Input Callback

def get_frame(self):
    while True:
        try:
            frame = self.frame_queue.get(timeout=5)
        except queue.Empty:
            return None

        if self.src_is_cam and self.capture_queue.full():
            continue

        if self.input_height is None or self.input_width is None:
            self.input_height = int(frame.shape[0] * self.scale)
            self.input_width = int(frame.shape[1] * self.scale)

        if self.mirror:
            frame = cv.flip(frame, 1)

        self.capture_queue.put(frame)
        return self.preprocess(frame)

Pose-Estimation Input Callback

def generate_frame(self):
    while True:
        try:
            frame = self.frame_queue.get(timeout=5)
        except queue.Empty:
            return None

        if self.src_is_cam and self.capture_queue.full():
            continue

        if self.input_height is None or self.input_width is None:
            self.input_height = int(frame.shape[0])
            self.input_width = int(frame.shape[1])

        if self.mirror:
            frame = cv.flip(frame, 1)

        self.capture_queue.put(frame)
        out, self.ratio = self.preprocess_image(frame)
        return out

C++

Cartoonizer Input Callback

bool CartoonApp::incallback_getframe(std::vector<const MX::Types::FeatureMap*> dst, int streamLabel)
{
    if (!runflag->load()) return false;

    cv::Mat inframe;
    {
        std::unique_lock<std::mutex> lock(*queue_mutex);
        queue_cv->wait(lock, [&]() { return !frame_queue->empty() || !runflag->load(); });

        if (!runflag->load()) return false;
        inframe = frame_queue->front().clone();
        frame_queue->pop();
    }

    cv::resize(inframe, img_resized, img_resized.size());
    cv::cvtColor(img_resized, img_resized, cv::COLOR_BGR2RGB);
    img_resized.convertTo(img_model_in, CV_32FC3, 1.0 / 127.5, -1.0);
    dst[0]->set_data((float*)img_model_in.data);
    
    return true;
}

Pose-Estimation Input Callback

bool PoseApp::incallback_getframe(std::vector<const MX::Types::FeatureMap*> dst, int streamLabel) {
    if (!runflag->load()) return false;

    cv::Mat inframe;
    {
        std::unique_lock<std::mutex> lock(*queue_mutex);
        queue_cv->wait(lock, [&]() { return !frame_queue->empty() || !runflag->load(); });

        if (!runflag->load()) return false;
        inframe = frame_queue->front().clone();
        frame_queue->pop();
    }

    {
        std::lock_guard<std::mutex> qlock(frameQueue_mutex);
        frames_queue.push_back(inframe.clone());
    }

    cv::Mat rgb, resized, floatImg;
    cv::cvtColor(inframe, rgb, cv::COLOR_BGR2RGB);
    cv::dnn::blobFromImage(rgb, resized, 1.0, cv::Size(model_input_width, model_input_height), cv::Scalar(0, 0, 0), true, false);
    resized.convertTo(floatImg, CV_32F, 1.0 / 255.0);
    dst[0]->set_data((float*)floatImg.data);

    return true;
}

5. Define an Output Function#

We also need to define an output function for each accelerator. The output callback function will post-process the accelerator’s output and display it on the screen.

Python

Cartoonizer Output Callback

def process_model_output(self, *ofmaps):
    img = self.postprocess(ofmaps[0], (self.input_width, self.input_height))
    self.display(img)
    self.capture_queue.get()
    return img

Pose-Estimation Output Callback

def process_model_output(self, *ofmaps):
    predict = ofmaps[0].squeeze(0).T
    predict = predict[predict[:, 4] > self.box_score]
    scores = predict[:, 4]
    boxes = self.xywh2xyxy(predict[:, 0:4] / self.ratio)
    kpts = predict[:, 5:]

    for i in range(kpts.shape[0]):
        for j in range(kpts.shape[1] // 3):
            if kpts[i, 3*j+2] < self.kpt_score:
                kpts[i, 3*j:3*(j+1)] = [-1, -1, -1]
            else:
                kpts[i, 3*j] /= self.ratio
                kpts[i, 3*j+1] /= self.ratio

    idxes = self.nms_process(boxes, scores, self.nms_thr)
    img = self.capture_queue.get()
    self.capture_queue.task_done()

    for kpt in np.array(kpts)[idxes]:
        for pair in self.KEYPOINT_PAIRS:
            pt1 = kpt[3 * pair[0]: 3 * (pair[0] + 1)]
            pt2 = kpt[3 * pair[1]: 3 * (pair[1] + 1)]
            if pt1[2] > 0 and pt2[2] > 0:
                cv.line(img, (int(pt1[0]), int(pt1[1])), (int(pt2[0]), int(pt2[1])), (255, 255, 255), 3)
        for idx in range(len(kpt) // 3):
            x, y, score = kpt[3*idx:3*(idx+1)]
            if score > 0:
                cv.circle(img, (int(x), int(y)), 5, self.COLOR_LIST[idx % len(self.COLOR_LIST)], -1)

    self.show(img)

C++

Cartoonizer Output Callback

bool CartoonApp::outcallback_getmxaoutput(std::vector<const MX::Types::FeatureMap*> src, int streamLabel)
{
    src[0]->get_data((float*)img_model_out.data);

    // Convert [-1,1] → [0,255] safely
    cv::add(img_model_out, cv::Scalar(1.0f, 1.0f, 1.0f), img_model_out);       
    cv::min(cv::max(img_model_out, 0.0f), 2.0f, img_model_out);       
    img_model_out.convertTo(img_model_out_uint, CV_8UC3, 127.5);                 

    // Resize for display and show
    cv::resize(img_model_out_uint, img_final_out_resized, displaySize);
    gui_->screens[0]->SetDisplayFrame(gui_stream_idx_, &img_final_out_resized);  

    return true;
}

Pose-Estimation Output Callback

bool PoseApp::outcallback_getmxaoutput(std::vector<const MX::Types::FeatureMap*> src, int streamLabel) {
    for (int i = 0; i < post_model_info.num_out_featuremaps; ++i)
        src[i]->get_data(ofmap[i]);

    cv::Mat inframe;
    {
        std::lock_guard<std::mutex> qlock(frameQueue_mutex);
        inframe = frames_queue.front().clone();
        frames_queue.pop_front();
    }

    float x_factor = inframe.cols / float(model_input_width);
    float y_factor = inframe.rows / float(model_input_height);

    std::vector<Box> boxes;
    std::vector<cv::Rect> rects;
    std::vector<float> scores;

    for (int i = 0; i < dets_length; ++i) {
        float conf = ofmap[0][4 * dets_length + i];
        if (conf < box_score) continue;

        float x = ofmap[0][0 * dets_length + i] * x_factor;
        float y = ofmap[0][1 * dets_length + i] * y_factor;
        float w = ofmap[0][2 * dets_length + i] * x_factor;
        float h = ofmap[0][3 * dets_length + i] * y_factor;

        int x1 = int(x - w / 2);
        int y1 = int(y - h / 2);
        int x2 = int(x + w / 2);
        int y2 = int(y + h / 2);

        Box box;
        box.confidence = conf;

        for (int j = 0; j < num_kpts; ++j) {
            float kx = ofmap[0][(5 + 3 * j + 0) * dets_length + i] * x_factor;
            float ky = ofmap[0][(5 + 3 * j + 1) * dets_length + i] * y_factor;
            float kc = ofmap[0][(5 + 3 * j + 2) * dets_length + i];
            box.keypoints.emplace_back(kc > kpt_score ? std::make_pair(kx, ky) : std::make_pair(-1.f, -1.f));
        }

        boxes.push_back(box);
        rects.emplace_back(x1, y1, x2 - x1, y2 - y1);
        scores.push_back(conf);
    }

    std::vector<int> keep;
    cv::dnn::NMSBoxes(rects, scores, box_score, nms_thr, keep);

    for (int idx : keep) {
        const auto& box = boxes[idx];
        for (const auto& pair : KEYPOINT_PAIRS) {
            auto p1 = box.keypoints[pair.first];
            auto p2 = box.keypoints[pair.second];
            if (p1.first != -1 && p2.first != -1)
                cv::line(inframe, {int(p1.first), int(p1.second)}, {int(p2.first), int(p2.second)}, {255,255,255}, 2);
        }

        for (int i = 0; i < box.keypoints.size(); ++i) {
            auto& kpt = box.keypoints[i];
            if (kpt.first != -1)
                cv::circle(inframe, {int(kpt.first), int(kpt.second)}, 3, COLOR_LIST[i % COLOR_LIST.size()], -1);
        }
    }

    cv::Mat bgrFrame;
    cv::cvtColor(inframe, bgrFrame, cv::COLOR_RGB2BGR);
    gui_->screens[0]->SetDisplayFrame(gui_stream_idx_, &bgrFrame);  

    return true;
}

6. Connect the Accelerator#

We create separate accl objects for each application and connect the corresponding input and output functions to the AsyncAccl API. The API handles the rest of the processing pipeline.

Each application—Cartoonizer and Pose Estimation—has its own setup function that creates and configures an AsyncAccl instance with two optional configurations:

SchedulerOptions Controls how many frames are processed, the sizes of input/output queues, and whether the accelerator should stop if the input queue becomes empty.

ClientOptions Enables FPS smoothing and sets a target input frame rate (e.g., 30 FPS), helping maintain consistent processing speed.

These options help manage performance and ensure that both accelerators run smoothly when used together. For more information, see the Accelerator API documentation, particularly the Runtime Scheduler Configuration section.

Python

Each application has its own AsyncAccl instance:

def run_cartoonizer(queue, dfp_path, display_thread, src_is_cam, frame_limit, stop_flag):
    sched_opts = SchedulerOptions(
        frame_limit,    # frame_limit: Max frames before DFP swap
        0,              # time_limit (ms): No time-based swap limit
        False,          # stop_on_empty: Keep DFP active even if input is empty
        24,             # ifmap_queue_size: Input queue capacity
        24              # ofmap_queue_size: Output queue capacity per client
    )
    client_opts = ClientOptions(
        True,           # smoothing: Enable FPS smoothing
        30.0            # fps_target: Limit input pacing to 30 FPS
    ) 
    accl = AsyncAccl(
        dfp_path,
        use_model_shape=(True, True),
        scheduler_options=sched_opts,
        client_options=client_opts
    )
    Cartoonizer(queue, accl, display_thread, src_is_cam=src_is_cam, stop_flag=stop_flag)
    return accl

def run_pose_estimation(queue, dfp_path, pose_post_model, input_shape, display_thread, src_is_cam, frame_limit, stop_flag):
    sched_opts = SchedulerOptions(frame_limit, 0, False, 24, 24)
    client_opts = ClientOptions(True, 30.0)
    accl = AsyncAccl(
        dfp_path,
        use_model_shape=(True, True),
        scheduler_options=sched_opts,
        client_options=client_opts
    )
    PoseEstmiation(queue, accl, display_thread, input_shape, mirror=True,
                   src_is_cam=src_is_cam, stop_flag=stop_flag, post_model=pose_post_model)
    return accl

Create threads for each application, plus the capture thread:

capture_proc = Process(
    target=shared_capture_loop,
    args=(input_source, queue_cartoonizer, queue_pose, stop_flag)
)

accl = run_cartoonizer(queue_cartoonizer, dfp_cartoon, displayer.update_left,
                       src_is_cam, args.frame_limit, stop_flag)

accl2 = run_pose_estimation(queue_pose, dfp_pose, pose_post_model, (640, 640),
                            displayer.update_right, src_is_cam, args.frame_limit, stop_flag)

print("[MAIN] Starting capture process and inference threads...")
capture_proc.start()

C++

// Accelerator options
MX::RPC::SchedulerOptions sched_opts{
    30,   // frame_limit: Max frames before DFP swap
    0,    // time_limit (ms): No time-based swap limit
    false,// stop_on_empty: Keep DFP active even if input is empty
    20,   // ifmap_queue_size: Input queue capacity
    20    // ofmap_queue_size: Output queue capacity per client
};

MX::RPC::ClientOptions client_opts{
    true,  // smoothing: Enable FPS smoothing
    30.0f  // fps_target: Limit input pacing to 30 FPS
};

// Cartoon pipeline
std::vector<int> cartoon_device = {0};
MX::Runtime::MxAccl accl_cartoon("Facial_cartoonizer_512_512_3_onnx.dfp", cartoon_device, {false, false}, false, sched_opts, client_opts);
CartoonApp cartoonApp(&accl_cartoon, &cartoon_queue, &cartoon_mutex, &cartoon_cv, &runflag, &gui, 0);

// Pose pipeline
std::vector<int> pose_device = {0};
MX::Runtime::MxAccl accl_pose("YOLO_v8_small_pose_640_640_3_onnx.dfp", pose_device, {true, true}, false, sched_opts, client_opts);
PoseApp poseApp(&accl_pose, "YOLO_v8_small_pose_640_640_3_onnx_post.onnx", &pose_queue, &pose_mutex, &pose_cv, &runflag, &gui, 1);

accl_pose.start();
accl_cartoon.start();

gui.Run();
std::cout << "[Main] GUI exited. Shutting down..." << std::endl;
runflag.store(false);

accl_pose.wait();
accl_cartoon.wait();
accl_pose.stop();
accl_cartoon.stop();

Third-Party Licenses#

This tutorial uses third-party software, models, and libraries. Below are the details of the licenses for these dependencies:

Model: Cartoonizer model from
- License: CC BY-NC-SA 4.0
Model: Yolov8s-pose from Ultralytics GitHub
- License: AGPLv3
Code and Pre/Post-Processing: Some code components, including pre/post-processing, were sourced from their GitHub
- License: AGPLv3

Summary#

This tutorial showed how to use the Accelerator API to run a multi-DFP application using Cartoonizer and Pose Estimation.