No-Copy C++ Optimization#

Note

Be sure to familiarize yourself with the C++ API before trying this optimization.

The C++ Runtime includes an advanced function, get_data_no_copy, that can be used in place of get_data in certain situations. It is useful in low-end systems such as Raspberry Pi, where copying data around in memory can add unwanted latencies.

In this tutorial we will disucuss in detail about this function, and show an example of its usage.

Details#

The C++ MxAccl API works with callback functions where users send the input through the FeatureMap object’s set_data function and receive the output through get_data. These functions write copies of the source data into the destination pointer, so that the MxAccl runtime’s data is separate from pointers/memory the user application is working on. This paradigm is used to ensure that users don’t have to worry about the lifetime or validity of pointers for data being sent to/from the MXA.

However, on low-end systems such as some ARM boards, the memcpy function used can be expensive. For users who want to optimize their code as much as possible, we provide the get_data_no_copy function, in which data is not copied from the source FeatureMap to the dest pointer; instead, the dest pointer is assigned to the FeatureMap’s internal data structure. This assignment is much faster than a copy, but the FeatureMap’s data is only valid during the output callback function.

Once your callback function returns, the pointer is no longer guaranteed to point to valid data and should not be dereferenced, modified, or deallocated.

How to Use#

From an API perspective, users just need to use the function get_data_no_copy instead of get_data. However, the surrounding application code will be different.

get_data#

For example, here’s some code that uses the get_data function to copy into an user-allocated float array. A running_thread function represents some other thread running in the application, such as a post-processing function or output display.

float* mydata;

void init(){
    mydata = new float[some_shape_here];
}

bool my_output_callback(vector<const MX::Types::FeatureMap*> src, int stream_num){
    // copies the src data (consider only 1 output featuremap) to our pointer
    src[0]->get_data(mydata);

    // exit the callback, indicating to MxAccl that we are done with src
    return true;
}

// some function that's running independently
void running_thread(){
    while(true){
        printf("Our 100th mydata value is currently: %f\n", mydata[99]);
        std::this_thread::sleep_for(std::chrono::seconds(1));
    }
}

void exit(){
    delete [] mydata;
}

get_data_no_copy: INCORRECT Usage#

If you were to simply skip the mydata alloc/dealloc and switch to get_data_no_copy in the above code you would have invalid memory access. For example,

float* mydata;

bool my_output_callback(vector<const MX::Types::FeatureMap*> src, int stream_num){
    // **Assigns** the mydata pointer to src's internal data structure
    src[0]->get_data_no_copy(mydata);

    // exit the callback, indicating to MxAccl that we are done with src
    // .. which we ARE NOT, because mydata will still point to it!
    return true;
}

// some function that's running independently
void running_thread(){
    while(true){
        // since mydata points to the src given to my_output_callback at some unknown time,
        // accessing this pointer is UNDEFINED BEHAVIOR
        printf("Our 100th mydata value is currently: %f\n", mydata[99]);
        std::this_thread::sleep_for(std::chrono::seconds(1));
    }
}

get_data_no_copy: Correct Usage#

However, if your application is structured in a way such that the usage of mydata is strictly before returning from the output callback, get_data_no_copy is safe to use.

float* mydata;

bool my_output_callback(vector<const MX::Types::FeatureMap*> src, int stream_num){
    // **Assigns** the mydata pointer to src's internal data structure
    src[0]->get_data(mydata);

    // call our function from within the callback thread
    print_100th_value();

    // exit the callback, indicating to MxAccl that we are done with src
    return true;
}

// this function is now called from my_output_callback, so it is not running in its own thread anymore
void print_100th_value(){
    // we know this is safe because my_output_callback has not yet exited -- mydata is still pointing to valid src data
    printf("Our 100th mydata value is currently: %f\n", mydata[99]);
}

Discussion#

The get_data_no_copy function can be useful in applications where neural network post-processing runs within the same thread as your callback, and is then copied elsewhere by you in your application. It also works in cases where post-processing and displaying the final results all happen in the callback as well.

But if your application uses a separate thread for post/display, or in multi-model apps such as the Face Detect + Emotion multi-model example, copies of the src data must be made in order to avoid accessing invalid memory.

Thus, on systems where memcpy overhead is minimal, such as decent x86 CPUs, the safer choice of get_data is preferred since the performance impact is low.