Cartoonizer and Pose Estimation#
Introduction#
The Multi-DFP example demonstrates how two distinct DFPs (Data Flow Pipelines)—Cartoonizer and Pose Estimation—can be combined and run as a single application on MemryX accelerators using the Acclerator API. It uses a pre-trained YOLOv8-small-pose model and an open-source cartoonizer model for real-time inference.
Note
This tutorial assumes a four-chip solution is correctly connected.
This example demonstrates how to run two applications together efficiently. For standalone usage, refer to their individual examples: Cartoonizer and Pose Estimation.
Download & Run
Download
This tutorial provides a high-level overview of the application’s key components. To run the full application, download the complete code package and the compiled DFP. After downloading, refer to the Run section below for step-by-step instructions.
Run
Requirements
Install the required dependencies:
pip install opencv-python==4.11.0.86
pip install PyQt5==5.15.11
pip install ultralytics==8.3.161
Run Command
Run the Python script:
cd src/python/
python run.py --cam
# Run it with a video:
python run.py --video <video_path>
Requirements Install the required dependencies and build the project:
# Install memx runtime plugins and utilities libs
sudo apt-get install memx-accl memx-accl-plugins memx-utils-gui
# Install OpenCV (ensure version compatibility)
sudo apt-get install libopencv-dev
# Build the C++ project
cd src/cpp/
mkdir build && cd build
cmake ..
make
Run Command
Run the C++ executable:
./cartoon_pose_app --cam
# Run it with a video:
./cartoon_pose_app --video <video_path>
1. Download the Model#
Pre-trained models for Cartoonizer and YOLOv8s-pose can be found on their respective GitHub pages: Cartoonizer model and Official YOLOv8s-pose.
For the sake of the tutorial, we exported and pre-compiled the model for the user to download.
The model can be found in the compressed folder attached
to this tutorial.
2. Compile the Model#
Note
You can use the pre-compiled DFP
attached to this tutorial and skip the compilation step. Please, make sure to include it in your working folder.
The compilation step is typically needed once and can be done using the Neural Compiler API or Tool.
The YOLOv8s-pose model was exported with the option to include a post-processing section in the model graph.
In your command line, you need to type,
mx_nc -v -m YOLO_v8_small_pose_640_640_3_onnx.onnx --autocrop -c 4
This will produce a DFP file ready to be used by the accelerator. In your Python code, you need to point the dfp, post_model
variables to the generated file paths.
dfp = "YOLO_v8_small_pose_640_640_3_onnx.dfp"
post_model = "YOLO_v8_small_pose_640_640_3_onnx_post.onnx"
from memryx import NeuralCompiler
nc = NeuralCompiler(num_chips=4, models="YOLO_v8_small_pose_640_640_3_onnx.onnx", verbose=1, autocrop=True)
dfp = nc.run()
Cartoonizer Model
To get the Cartoonizer ONNX model, please follow these steps. After obtaining the ONNX model, you can compile it as shown below.
In your command line, you need to type,
mx_nc -v -m facial-cartoonizer_512.onnx --autocrop -c 4
This will produce a DFP file ready to be used by the accelerator. In your Python code, you need to point the dfp
variables to the generated file paths.
cartoon_dfp = "facial-cartoonizer_512.dfp"
from memryx import NeuralCompiler
nc = NeuralCompiler(num_chips=4, models="facial-cartoonizer_512.onnx", verbose=1)
dfp = nc.run()
3. CV Initializations#
We will import the needed libraries and intialize the CV pipeline in this step.
import os
import sys
from pathlib import Path
import cv2 as cv
import numpy as np
from collections import deque
import argparse
from multiprocessing import Process, Queue, Event
from memryx import AsyncAccl
from apps import Cartoonizer, PoseEstmiation
from PyQt5.QtWidgets import QApplication
from displayer import Displayer
from memryx.runtime import SchedulerOptions, ClientOptions
This function reads frames from a single camera source and places them into two separate queues, allowing the same frames to be passed concurrently to both the Cartoonizer and Pose Estimation accelerators.
def shared_capture_loop(src, queue1, queue2, stop_flag):
cap = cv.VideoCapture(src)
# Check if the source is a camera or video file
is_cam = isinstance(src, int) or (isinstance(src, str) and src.startswith('/dev/video'))
while not stop_flag.is_set():
ret, frame = cap.read()
if not ret:
break
for q in [queue1, queue2]:
if not q.full():
q.put(frame.copy())
cap.release()
if not is_cam:
stop_flag.set()
#include <iostream>
#include <thread>
#include <atomic>
#include <mutex>
#include <csignal>
#include <queue>
#include <condition_variable>
#include <opencv2/opencv.hpp>
#include "memx/accl/MxAccl.h"
#include <memx/mxutils/gui_view.h>
#include "PoseApp.h"
#include "CartoonApp.h"
// Frame queues and sync
std::queue<cv::Mat> cartoon_queue;
std::queue<cv::Mat> pose_queue;
std::mutex cartoon_mutex, pose_mutex;
std::condition_variable cartoon_cv, pose_cv;
// Runtime state
std::atomic_bool runflag(true);
cv::VideoCapture vcap;
// Signal handler to safely stop
void signalHandler(int signum) {
runflag.store(false);
}
const size_t MAX_QUEUE_SIZE = 20; // You can adjust this as needed
This function reads frames from a single camera source and places them into two separate queues, allowing the same frames to be passed concurrently to both the Cartoonizer and Pose Estimation accelerators. If the input is a video file, the application waits until the queues have been fully processed by the accelerators before exiting.
void frameProducer(cv::VideoCapture& cap, std::atomic_bool& runflag, bool is_video) {
cv::Mat frame;
while (runflag.load()) {
if (!cap.read(frame)) {
std::cerr << "[Producer] End of video or failed to read frame.\n";
break;
}
{
std::lock_guard<std::mutex> lock(cartoon_mutex);
cartoon_queue.push(frame.clone());
cartoon_cv.notify_one();
}
{
std::lock_guard<std::mutex> lock(pose_mutex);
pose_queue.push(frame.clone());
pose_cv.notify_one();
}
}
if (is_video) {
std::cout << "[Producer] Waiting for queues to flush before exiting...\n";
while (runflag.load()) {
std::unique_lock<std::mutex> lock1(cartoon_mutex, std::defer_lock);
std::unique_lock<std::mutex> lock2(pose_mutex, std::defer_lock);
std::lock(lock1, lock2);
if (cartoon_queue.empty() && pose_queue.empty())
break;
lock1.unlock();
lock2.unlock();
std::this_thread::sleep_for(std::chrono::milliseconds(100));
}
std::cout << "[Producer] Flush wait done or interrupted. Proceeding to shutdown...\n";
}
runflag.store(false);
cartoon_cv.notify_all();
pose_cv.notify_all();
}
4. Define an Input Function#
We need to define an input function for each accelerator (Cartoonizer and Pose). In this case, the input function retrieves a new frame from the queue, pre-processes it, and returns the image to the accelerator.
Cartoonizer Input Callback
def get_frame(self):
while True:
try:
frame = self.frame_queue.get(timeout=5)
except queue.Empty:
return None
if self.src_is_cam and self.capture_queue.full():
continue
if self.input_height is None or self.input_width is None:
self.input_height = int(frame.shape[0] * self.scale)
self.input_width = int(frame.shape[1] * self.scale)
if self.mirror:
frame = cv.flip(frame, 1)
self.capture_queue.put(frame)
return self.preprocess(frame)
Pose-Estimation Input Callback
def generate_frame(self):
while True:
try:
frame = self.frame_queue.get(timeout=5)
except queue.Empty:
return None
if self.src_is_cam and self.capture_queue.full():
continue
if self.input_height is None or self.input_width is None:
self.input_height = int(frame.shape[0])
self.input_width = int(frame.shape[1])
if self.mirror:
frame = cv.flip(frame, 1)
self.capture_queue.put(frame)
out, self.ratio = self.preprocess_image(frame)
return out
Cartoonizer Input Callback
bool CartoonApp::incallback_getframe(std::vector<const MX::Types::FeatureMap*> dst, int streamLabel)
{
if (!runflag->load()) return false;
cv::Mat inframe;
{
std::unique_lock<std::mutex> lock(*queue_mutex);
queue_cv->wait(lock, [&]() { return !frame_queue->empty() || !runflag->load(); });
if (!runflag->load()) return false;
inframe = frame_queue->front().clone();
frame_queue->pop();
}
cv::resize(inframe, img_resized, img_resized.size());
cv::cvtColor(img_resized, img_resized, cv::COLOR_BGR2RGB);
img_resized.convertTo(img_model_in, CV_32FC3, 1.0 / 127.5, -1.0);
dst[0]->set_data((float*)img_model_in.data);
return true;
}
Pose-Estimation Input Callback
bool PoseApp::incallback_getframe(std::vector<const MX::Types::FeatureMap*> dst, int streamLabel) {
if (!runflag->load()) return false;
cv::Mat inframe;
{
std::unique_lock<std::mutex> lock(*queue_mutex);
queue_cv->wait(lock, [&]() { return !frame_queue->empty() || !runflag->load(); });
if (!runflag->load()) return false;
inframe = frame_queue->front().clone();
frame_queue->pop();
}
{
std::lock_guard<std::mutex> qlock(frameQueue_mutex);
frames_queue.push_back(inframe.clone());
}
cv::Mat rgb, resized, floatImg;
cv::cvtColor(inframe, rgb, cv::COLOR_BGR2RGB);
cv::dnn::blobFromImage(rgb, resized, 1.0, cv::Size(model_input_width, model_input_height), cv::Scalar(0, 0, 0), true, false);
resized.convertTo(floatImg, CV_32F, 1.0 / 255.0);
dst[0]->set_data((float*)floatImg.data);
return true;
}
5. Define an Output Function#
We also need to define an output function for each accelerator. The output callback function will post-process the accelerator’s output and display it on the screen.
Cartoonizer Output Callback
def process_model_output(self, *ofmaps):
img = self.postprocess(ofmaps[0], (self.input_width, self.input_height))
self.display(img)
self.capture_queue.get()
return img
Pose-Estimation Output Callback
def process_model_output(self, *ofmaps):
predict = ofmaps[0].squeeze(0).T
predict = predict[predict[:, 4] > self.box_score]
scores = predict[:, 4]
boxes = self.xywh2xyxy(predict[:, 0:4] / self.ratio)
kpts = predict[:, 5:]
for i in range(kpts.shape[0]):
for j in range(kpts.shape[1] // 3):
if kpts[i, 3*j+2] < self.kpt_score:
kpts[i, 3*j:3*(j+1)] = [-1, -1, -1]
else:
kpts[i, 3*j] /= self.ratio
kpts[i, 3*j+1] /= self.ratio
idxes = self.nms_process(boxes, scores, self.nms_thr)
img = self.capture_queue.get()
self.capture_queue.task_done()
for kpt in np.array(kpts)[idxes]:
for pair in self.KEYPOINT_PAIRS:
pt1 = kpt[3 * pair[0]: 3 * (pair[0] + 1)]
pt2 = kpt[3 * pair[1]: 3 * (pair[1] + 1)]
if pt1[2] > 0 and pt2[2] > 0:
cv.line(img, (int(pt1[0]), int(pt1[1])), (int(pt2[0]), int(pt2[1])), (255, 255, 255), 3)
for idx in range(len(kpt) // 3):
x, y, score = kpt[3*idx:3*(idx+1)]
if score > 0:
cv.circle(img, (int(x), int(y)), 5, self.COLOR_LIST[idx % len(self.COLOR_LIST)], -1)
self.show(img)
Cartoonizer Output Callback
bool CartoonApp::outcallback_getmxaoutput(std::vector<const MX::Types::FeatureMap*> src, int streamLabel)
{
src[0]->get_data((float*)img_model_out.data);
// Convert [-1,1] → [0,255] safely
cv::add(img_model_out, cv::Scalar(1.0f, 1.0f, 1.0f), img_model_out);
cv::min(cv::max(img_model_out, 0.0f), 2.0f, img_model_out);
img_model_out.convertTo(img_model_out_uint, CV_8UC3, 127.5);
// Resize for display and show
cv::resize(img_model_out_uint, img_final_out_resized, displaySize);
gui_->screens[0]->SetDisplayFrame(gui_stream_idx_, &img_final_out_resized);
return true;
}
Pose-Estimation Output Callback
bool PoseApp::outcallback_getmxaoutput(std::vector<const MX::Types::FeatureMap*> src, int streamLabel) {
for (int i = 0; i < post_model_info.num_out_featuremaps; ++i)
src[i]->get_data(ofmap[i]);
cv::Mat inframe;
{
std::lock_guard<std::mutex> qlock(frameQueue_mutex);
inframe = frames_queue.front().clone();
frames_queue.pop_front();
}
float x_factor = inframe.cols / float(model_input_width);
float y_factor = inframe.rows / float(model_input_height);
std::vector<Box> boxes;
std::vector<cv::Rect> rects;
std::vector<float> scores;
for (int i = 0; i < dets_length; ++i) {
float conf = ofmap[0][4 * dets_length + i];
if (conf < box_score) continue;
float x = ofmap[0][0 * dets_length + i] * x_factor;
float y = ofmap[0][1 * dets_length + i] * y_factor;
float w = ofmap[0][2 * dets_length + i] * x_factor;
float h = ofmap[0][3 * dets_length + i] * y_factor;
int x1 = int(x - w / 2);
int y1 = int(y - h / 2);
int x2 = int(x + w / 2);
int y2 = int(y + h / 2);
Box box;
box.confidence = conf;
for (int j = 0; j < num_kpts; ++j) {
float kx = ofmap[0][(5 + 3 * j + 0) * dets_length + i] * x_factor;
float ky = ofmap[0][(5 + 3 * j + 1) * dets_length + i] * y_factor;
float kc = ofmap[0][(5 + 3 * j + 2) * dets_length + i];
box.keypoints.emplace_back(kc > kpt_score ? std::make_pair(kx, ky) : std::make_pair(-1.f, -1.f));
}
boxes.push_back(box);
rects.emplace_back(x1, y1, x2 - x1, y2 - y1);
scores.push_back(conf);
}
std::vector<int> keep;
cv::dnn::NMSBoxes(rects, scores, box_score, nms_thr, keep);
for (int idx : keep) {
const auto& box = boxes[idx];
for (const auto& pair : KEYPOINT_PAIRS) {
auto p1 = box.keypoints[pair.first];
auto p2 = box.keypoints[pair.second];
if (p1.first != -1 && p2.first != -1)
cv::line(inframe, {int(p1.first), int(p1.second)}, {int(p2.first), int(p2.second)}, {255,255,255}, 2);
}
for (int i = 0; i < box.keypoints.size(); ++i) {
auto& kpt = box.keypoints[i];
if (kpt.first != -1)
cv::circle(inframe, {int(kpt.first), int(kpt.second)}, 3, COLOR_LIST[i % COLOR_LIST.size()], -1);
}
}
cv::Mat bgrFrame;
cv::cvtColor(inframe, bgrFrame, cv::COLOR_RGB2BGR);
gui_->screens[0]->SetDisplayFrame(gui_stream_idx_, &bgrFrame);
return true;
}
6. Connect the Accelerator#
We create separate accl objects for each application and connect the corresponding input and output functions to the AsyncAccl API. The API handles the rest of the processing pipeline.
Each application—Cartoonizer and Pose Estimation—has its own setup function that creates and configures an AsyncAccl instance with two optional configurations:
SchedulerOptions Controls how many frames are processed, the sizes of input/output queues, and whether the accelerator should stop if the input queue becomes empty.
ClientOptions Enables FPS smoothing and sets a target input frame rate (e.g., 30 FPS), helping maintain consistent processing speed.
These options help manage performance and ensure that both accelerators run smoothly when used together. For more information, see the Accelerator API documentation, particularly the Runtime Scheduler Configuration section.
Each application has its own AsyncAccl
instance:
def run_cartoonizer(queue, dfp_path, display_thread, src_is_cam, frame_limit, stop_flag):
sched_opts = SchedulerOptions(
frame_limit, # frame_limit: Max frames before DFP swap
0, # time_limit (ms): No time-based swap limit
False, # stop_on_empty: Keep DFP active even if input is empty
24, # ifmap_queue_size: Input queue capacity
24 # ofmap_queue_size: Output queue capacity per client
)
client_opts = ClientOptions(
True, # smoothing: Enable FPS smoothing
30.0 # fps_target: Limit input pacing to 30 FPS
)
accl = AsyncAccl(
dfp_path,
use_model_shape=(True, True),
scheduler_options=sched_opts,
client_options=client_opts
)
Cartoonizer(queue, accl, display_thread, src_is_cam=src_is_cam, stop_flag=stop_flag)
return accl
def run_pose_estimation(queue, dfp_path, pose_post_model, input_shape, display_thread, src_is_cam, frame_limit, stop_flag):
sched_opts = SchedulerOptions(frame_limit, 0, False, 24, 24)
client_opts = ClientOptions(True, 30.0)
accl = AsyncAccl(
dfp_path,
use_model_shape=(True, True),
scheduler_options=sched_opts,
client_options=client_opts
)
PoseEstmiation(queue, accl, display_thread, input_shape, mirror=True,
src_is_cam=src_is_cam, stop_flag=stop_flag, post_model=pose_post_model)
return accl
Create threads for each application, plus the capture thread:
capture_proc = Process(
target=shared_capture_loop,
args=(input_source, queue_cartoonizer, queue_pose, stop_flag)
)
accl = run_cartoonizer(queue_cartoonizer, dfp_cartoon, displayer.update_left,
src_is_cam, args.frame_limit, stop_flag)
accl2 = run_pose_estimation(queue_pose, dfp_pose, pose_post_model, (640, 640),
displayer.update_right, src_is_cam, args.frame_limit, stop_flag)
print("[MAIN] Starting capture process and inference threads...")
capture_proc.start()
// Accelerator options
MX::RPC::SchedulerOptions sched_opts{
30, // frame_limit: Max frames before DFP swap
0, // time_limit (ms): No time-based swap limit
false,// stop_on_empty: Keep DFP active even if input is empty
20, // ifmap_queue_size: Input queue capacity
20 // ofmap_queue_size: Output queue capacity per client
};
MX::RPC::ClientOptions client_opts{
true, // smoothing: Enable FPS smoothing
30.0f // fps_target: Limit input pacing to 30 FPS
};
// Cartoon pipeline
std::vector<int> cartoon_device = {0};
MX::Runtime::MxAccl accl_cartoon("Facial_cartoonizer_512_512_3_onnx.dfp", cartoon_device, {false, false}, false, sched_opts, client_opts);
CartoonApp cartoonApp(&accl_cartoon, &cartoon_queue, &cartoon_mutex, &cartoon_cv, &runflag, &gui, 0);
// Pose pipeline
std::vector<int> pose_device = {0};
MX::Runtime::MxAccl accl_pose("YOLO_v8_small_pose_640_640_3_onnx.dfp", pose_device, {true, true}, false, sched_opts, client_opts);
PoseApp poseApp(&accl_pose, "YOLO_v8_small_pose_640_640_3_onnx_post.onnx", &pose_queue, &pose_mutex, &pose_cv, &runflag, &gui, 1);
accl_pose.start();
accl_cartoon.start();
gui.Run();
std::cout << "[Main] GUI exited. Shutting down..." << std::endl;
runflag.store(false);
accl_pose.wait();
accl_cartoon.wait();
accl_pose.stop();
accl_cartoon.stop();
Third-Party Licenses#
This tutorial uses third-party software, models, and libraries. Below are the details of the licenses for these dependencies:
Model: Cartoonizer model from
License: CC BY-NC-SA 4.0
Model: Yolov8s-pose from Ultralytics GitHub
License: AGPLv3
Code and Pre/Post-Processing: Some code components, including pre/post-processing, were sourced from their GitHub
License: AGPLv3
Summary#
This tutorial showed how to use the Accelerator API to run a multi-DFP application using Cartoonizer and Pose Estimation.