Driver Usage#

Introduction#

In this tutorial, we will provide more examples to run inference with different models and devices combinations. Be sure to read Classification using Driver APIs for basic driver knowledge in advance.

Note

OpenCV is used in the following example but is not included in the package. Please refer to OpenCV official website for more information.

Before we look into following examples, it is important to first understand different types of DFP file. In some of the cases, runtime model-swapping are required to avoid hardware sources access conflict.

Examples are constructed as three parts: run inference sub-routine, run inference by model and application main. Descriptions are provided for each example while source codes can be downloaded from Resources.

  • Single model is compiled in one DFP file. This is the most common and basic scenario. We run only one model inference using one accelerator (no matter cascaded or not). No runtime model-swapping is required. See Basic Inference.

  • Multiple models are compiled in same DFP file and coexist. Two or more models are compiled with no hardware resources overlapped. In other words, these models can run inference simultaneously using the same accelerator with different input and output ports (flows) are given. No runtime model-swapping is required. See Example 2: Multiple Models Coexist.

  • Multiple models are compiled into different DFP files. Two or more models are compiled separatedly into multiple DFP files. Both weight memory re-configuration and model-swapping are required in the runtime. See Example 3: Runtime Model Swap

Another common strategy is to use multiple accelerators (no matter cascaded or not) to run inference in parallel to reduce overall inference time and latency.

  • Multiple models with runtime device selection. Dynamically select device which is available during runtime. This example can be viewed as an extension of Example 3: Runtime Model Swap. Here we demonstrate how to run two models on two accelerators separately with runtime device selection. See Example 4: Runtime Device Selection

Example 2: Multiple Models Coexist#

2.1 Run Inference Sub-routine#

// Since two models share the same procedure to run inference, we make it a
// common sub-routine with structure of parameters here.
typedef struct  {
  uint8_t model_id; // model ID
  uint8_t group_id; // MPU device group ID
  uint8_t iport; // input port ID
  void* ifmap; // input feature map
  uint8_t oport; // output port ID
  void* ofmap; // output feature map
} RunInferenceConfig;

// Simple frame-in-frame-out inference. Two models can run inference
// simultaneously for they are using different ports.
memx_status run_inference(RunInferenceConfig* config)
{
  memx_status status = MEMX_STATUS_OK;
  const int timeout = 200; // 200 ms

  // 1. Enable data transfer of both models to device.
  if (memx_status_no_error(status)) {
    status = memx_set_stream_enable(config->model_id, 0);
  }
  // 2. Write input feature map to device to run inference
  if (memx_status_no_error(status)) {
    status = memx_stream_ifmap(config->model_id,
      config->iport, config->ifmap, timeout);
  }
  // 3. Read output feature map from device after inference
  if (memx_status_no_error(status)) {
    status = memx_stream_ofmap(config->model_id,
      config->oport, config->ofmap, timeout);
  }
  // 4. Disable data transfer of this model to device.
  if (memx_status_no_error(status)) {
    status = memx_set_stream_disable(config->model_id, 0);
  }

  return status;
}
class RunInferenceConfig:
  '''Since two models share the same procedure to run inference, we make it a
    common sub-routine with structure of parameters here.'''
  def __init__(self):
    self.model_id = 0 # model ID
    self.group_id = 0 # MPU device group ID
    self.iport = 0 # input port ID
    self.ifmap = None # input feature map
    self.oport = 0 # output port ID
    self.ofmap = None # output feature map

def run_inference(config: RunInferenceConfig) -> int:
  '''Simple frame-in-frame-out inference. Two models can run inference
    simultaneously for they are using different ports.'''
  err = 0

  # 1. Enable data transfer of both models to device.
  if not err:
    err = mxa.set_stream_enable(config.model_id, 0)  
  # 2. Write input feature map to device to run inference
  if not err:
    err = mxa.stream_ifmap(config.model_id,
      config.iport, config.ifmap, timeout=200)
  # 3. Read output feature map from device after inference
  if not err:
    err = mxa.stream_ofmap(config.model_id,
      config.oport, config.ofmap, timeout=200)
  # 4. Disable data transfer of this model to device.
  if not err:
    err = mxa.set_stream_disable(config.model_id, 0)

  return err

Since both models share the same inference procedure, here we make it a common sub-routine with different parameters later be setup to run inference.

2.2 Run Inference By Model#

// Model 0 run inference, this sub-routine runs in background thread
void* run_inference_model_0(void* arg)
{
  // Assumes input feature map uses only flow 0 as format float32(224,224,3)
  float* ifmap;
  // Assumes output feature map uses only flow 0 as format float32(1,1,1000)
  float ofmap[1*1*1000]; // allocate memory space

  // 1. Pre-process input feature map
  Mat img = imread("image.png", IMREAD_COLOR);
  cv::resize(img, img, cv::Size(224,224), 0, 0, CV_INTER_LINEAR);
  img.convertTo(img, CV_32F, 1.0/127.5, -1);
  ifmap = (float*)img.data;

  // 2. Run inference setup
  RunInferenceConfig config;
  config.model_id = MODEL_ID; // model 0
  config.group_id = GROUP_ID; // device 0
  config.iport = 0; // input port 0 (flow 0)
  config.ifmap = ifmap; // input feature map
  config.oport = 0; // output port 0 (flow 0)
  config.ofmap = ofmap; // output feature map

  // 3. Run inference common sub-routine
  memx_status status = run_inference(&config);

  // 4. Post-process output feature map
  if (memx_status_no_error(status)) {
    int argmax = 0;
    for (int i = 1; i < 1000; ++i) {
      argmax = (ofmap[i] > ofmap[argmax]) ? i : argmax;
    }
    printf(" - Model 0 argmax = %d\n", argmax);
  } else {
    printf(" - Model 0 failed to run inference = %d\n", status);
  }

  return NULL;
}
def run_inference_model_0() -> None:
  '''Model 0 run inference, this sub-routine runs in background thread'''
  # Assumes input feature map uses only flow 0 as format float32(224,224,3)
  ifmap = None
  # Assumes output feature map uses only flow 0 as format float32(1,1,1000)
  ofmap = np.zeros((1,1,1000), dtype=np.float32) # allocate memory space

  # 1. Pre-process input feature map
  img = cv2.imread(r"image.png")
  img = cv2.resize(img, (224,224), interpolation=cv2.INTER_LINEAR)
  ifmap = img.astype(np.float32) / 127.5 - 1 # type(ifmap) = <class 'numpy.ndarray'>

  # 2. Run inference setup
  config = RunInferenceConfig()
  config.model_id = MODEL_ID # model 0
  config.group_id = GROUP_ID # device 0
  config.iport = 0 # input port 0 (flow 0)
  config.ifmap = ifmap # input feature map
  config.oport = 0 # output port 0 (flow 0)
  config.ofmap = ofmap # output feature map

  # 3. Run inference common sub-routine
  err = run_inference(config)

  # 4. Post-process output feature map
  if not err:
    ofmap = ofmap.reshape(1000) # reshape to single dimension
    argmax = np.argmax(ofmap)
    print(" - Model 0 argmax =", argmax)
  else:
    print(" - Model 0 failed to run inference =", err)

After we have our common inference sub-routine, we here implement our full inference sub-routine which includes pre-processing and post-processing for each model separately. Because of pre-processing and post-processing of both models we used are similar, we show only one sub-routine (model 0) here.

In this example, two models are compiled together in the same DFP file with no hardware resources overlapped. That means we can view these two models as one big model with two flows.

  RunInferenceConfig config;
  config.model_id = MODEL_ID; // model 0
  config.group_id = GROUP_ID; // device 0
  config.iport = 1; // input port 1 (flow 1)
  config.ifmap = ifmap; // input feature map
  config.oport = 1; // output port 1 (flow 1)
  config.ofmap = ofmap; // output feature map
  config = RunInferenceConfig()
  config.model_id = MODEL_ID # model 0
  config.group_id = GROUP_ID # device 0
  config.iport = 1 # input port 1 (flow 1)
  config.ifmap = ifmap # input feature map
  config.oport = 1 # output port 1 (flow 1)
  config.ofmap = ofmap # output feature map

The only thing to be noticed is that we still use one model instead of two model contexts but with different input and output port setup to represent different models.

Note

In fact, we can simply run the big model inference as what we did in Basic Inference with only input and output port setup changed. Here we just provide an example to run two models’ infererence in parallel.

2.3 Application Main#

// Main process, create two threads to run inferences in parallel.
int main(void) {
  memx_status status = MEMX_STATUS_OK;
  pthread_t t0, t1;

  // 1. Bind MPU device group 0 as MX3:Cascade to model.
  if (memx_status_no_error(status)) {
    status = memx_open(MODEL_ID, GROUP_ID, MEMX_DEVICE_CASCADE);
  }

  // 2. Download weight memory and model to device. Because two models are
  // compiled together in one DFP file and coexist with no hardware resources
  // overlapped, we only need to download to device once.
  if (memx_status_no_error(status)) {
    status = memx_download_model(MODEL_ID,
      "models/mobilenet_v1_v2.dfp", 0, // model_idx = 0
      MEMX_DOWNLOAD_TYPE_WTMEM_AND_MODEL);
  }

  // 3. Run two models simultaneously using posix threads (Linux only)
  if (memx_status_no_error(status)) {
    if ((pthread_create(&t0, NULL, &run_inference_model_0, NULL) != 0)
      ||(pthread_create(&t1, NULL, &run_inference_model_1, NULL) != 0)) {
      status = MEMX_STATUS_OTHERS;
    }
  }
  if (memx_status_no_error(status)) {
    pthread_join(t0, NULL);
    pthread_join(t1, NULL);
  }

  // 4. Always remember to clean-up resources before leaving.
  memx_close(MODEL_ID);

  // End of process
  if (memx_status_no_error(status)) {
    printf("success.\n");
  } else {
    printf("failure.\n");
  }
  return 0;
}
def main():
  '''Main process, create two threads to run inferences in parallel.'''
  err = 0

  # 1. Bind MPU device group 0 as MX3:Cascade to model.
  if not err:
    err = mxa.open(MODEL_ID, GROUP_ID, 3) # 3 = MX3:Cascade

  # 2. Download weight memory and model to device. Because two models are
  # compiled together in one DFP file and coexist with no hardware resources
  # overlapped, we only need to download to device once.
  if not err:
    err = mxa.download(MODEL_ID, r"models/mobilenet_v1_v2.dfp", 0, # model_idx = 0
      mxa.download_type_wtmem_and_model)

  # 3. Run two models simultaneously using threads
  t0 = threading.Thread(target=run_inference_model_0, args=())
  t1 = threading.Thread(target=run_inference_model_1, args=())
  t0.start()
  t1.start()
  t0.join()
  t1.join()

  # 4. Always remember to clean-up resources before leaving.
  mxa.close(MODEL_ID)

  # End of process
  if not err:
    print("success.")
  else:
    print("failure.")

if __name__ == "__main__":
  main()

As what we mentioned above, we put model download in main() since we only need to download DFP file once to run two models’ inference. Here we use two threads to demonstrate how to run inferences in parallel.

2.4 Execution Result#

$ mkdir build && cd build && cmake .. && make && cd ..
$ sudo ./build/memx_c_example2
 - Model 0 argmax = 92
 - Model 1 argmax = 284
success.
$ sudo python3 memx_py_example2.py
 - Model 0 argmax = 92
 - Model 1 argmax = 284
success.

The output message on terminal shows that the highest score class indexes are 92 and 284, which correspond to labels bee_eater and siamese_cat. These results are correct based on the dataset used to train these two example models.

Note

In this example, we use CMake to build source code. Make sure the static paths given in your source code if runtime failure happenes, and to modify them to fit your environment setup if you want to give this example a try.

Example 3: Runtime Model Swap#

3.1 Run Inference Sub-routine#

// Since two models share the same procedure to run inference, we make it a
// common sub-routine with structure of parameters here.
typedef struct  {
  uint8_t model_id; // model ID
  uint8_t group_id; // MPU device group ID
  const char* dfp_path; // DFP file path
  uint8_t iport; // input port ID
  void* ifmap; // input feature map
  uint8_t oport; // output port ID
  void* ofmap; // output feature map
} RunInferenceConfig;

// In order to guarantee only one model can access device in multi-threading
// case, we acquire `device lock` from driver before any hardware configuration.
memx_status run_inference(RunInferenceConfig* config)
{
  const int timeout = 200; // 200 ms

  // 1. Get lock first before hardware configuration. Block and wait here
  // until lock is acquired. Remember to lock 'group_id' instead of
  // 'model_id' since we are trying to lock hardware resource.
  memx_lock(config->group_id);

  // 2. Download weight memory and model to device.
  memx_status status = memx_download_model(config->model_id,
    config->dfp_path, 0, MEMX_DOWNLOAD_TYPE_WTMEM_AND_MODEL);
  // 3. Enable data transfer of this model to device.
  if (memx_status_no_error(status)) {
    status = memx_set_stream_enable(config->model_id, 0);
  }
  // 4. Write input feature map to device to run inference
  if (memx_status_no_error(status)) {
    status = memx_stream_ifmap(config->model_id,
      config->iport, config->ifmap, timeout);
  }
  // 5. Read output feature map from device after inference
  if (memx_status_no_error(status)) {
    status = memx_stream_ofmap(config->model_id,
      config->oport, config->ofmap, timeout);
  }
  // 6. Disable data transfer of this model to device.
  if (memx_status_no_error(status)) {
    // wait to stop may take some time, but is safe
    status = memx_set_stream_disable(config->model_id, 1);
  }

  // 7. Always remember to release lock finally.
  memx_unlock(config->group_id);

  return status;
}
class RunInferenceConfig:
  '''Since two models share the same procedure to run inference, we make it a
    common sub-routine with structure of parameters here.'''
  def __init__(self):
    self.model_id = 0 # model ID
    self.group_id = 0 # MPU device group ID
    self.dfp_path = None # DFP file path
    self.iport = 0 # input port ID
    self.ifmap = None # input feature map
    self.oport = 0 # output port ID
    self.ofmap = None # output feature map

def run_inference(config: RunInferenceConfig) -> int:
  '''In order to guarantee only one model can access device in multi-threading
    case, we acquire `device lock` from driver before any hardware
    configuration.'''
  timeout = 200 # 200 ms

  # 1. Get lock first before hardware configuration. Block and wait here
  # until lock is acquired. Remember to lock 'group_id' instead of
  # 'model_id' since we are trying to lock hardware resource.
  mxa.lock(config.group_id)

  # 2. Download weight memory and model to device.
  err = mxa.download(config.model_id, config.dfp_path, 0,
    mxa.download_type_wtmem_and_model)
  # 3. Enable data transfer of this model to device.
  if not err:
    err = mxa.set_stream_enable(config.model_id, 0)
  # 4. Write input feature map to device to run inference
  if not err:
    err = mxa.stream_ifmap(config.model_id,
      config.iport, config.ifmap, timeout=timeout)
  # 5. Read output feature map from device after inference
  if not err:
    err = mxa.stream_ofmap(config.model_id,
      config.oport, config.ofmap, timeout=timeout)
  # 6. Disable data transfer of this model to device.
  if not err:
    # wait to stop may take some time, but is safe
    err = mxa.set_stream_disable(config.model_id, 1)

  # 7. Always remember to release lock finally.
  mxa.unlock(config.group_id)

  return err

The most important part in run_inference() of this example, is the usage of lock() and unlock().

  // 1. Get lock first before hardware configuration. Block and wait here
  // until lock is acquired. Remember to lock 'group_id' instead of
  // 'model_id' since we are trying to lock hardware resource.
  memx_lock(config->group_id);

  memx_unlock(config->group_id);
  # 1. Get lock first before hardware configuration. Block and wait here
  # until lock is acquired. Remember to lock 'group_id' instead of
  # 'model_id' since we are trying to lock hardware resource.
  mxa.lock(config.group_id)

  # 7. Always remember to release lock finally.
  mxa.unlock(config.group_id)

Acquiring lock from driver help us to make sure there is always only one model can access device through interface in the same time. This helps to avoid models unexpectedly re-configure device when others are using device or take inference data from each other. After lock is acquired in this example, we can start to re-configure device by downloading weight memory and model then and run inference.

Note

Always remember to unlock in the end of sub-routine, otherwise other sub-routine might be blocked and wait forever. Also, be careful not to unlock others’s lock.

3.2 Run Inference By Model#

  RunInferenceConfig config;
  config.model_id = MODEL_1_ID; // model 1
  config.group_id = GROUP_ID; // device 0
  config.dfp_path = "models/mobilenet_v2.dfp";
  config.iport = 0; // input port 0 (flow 0)
  config.ifmap = ifmap; // input feature map
  config.oport = 0; // output port 0 (flow 0)
  config.ofmap = ofmap; // output feature map
  # 2. Run inference setup
  config = RunInferenceConfig()
  config.model_id = MODEL_1_ID # model 1
  config.group_id = GROUP_ID # device 0
  config.dfp_path = r"models/mobilenet_v2.dfp"
  config.iport = 0 # input port 0 (flow 0)
  config.ifmap = ifmap # input feature map
  config.oport = 0 # output port 0 (flow 0)
  config.ofmap = ofmap # output feature map

Basically inference sub-routine is similar to 2.2 Run Inference By Model with only setup up is slightly different.

3.3 Application Main#

int main(void) {
  memx_status status = MEMX_STATUS_OK;
  pthread_t t0, t1;

  // 1. Bind MPU device group 0 as MX3:Cascade to both model 0 and model 1.
  if (memx_status_no_error(status)) {
    status = memx_open(MODEL_0_ID, GROUP_ID, MEMX_DEVICE_CASCADE);
  }
  if (memx_status_no_error(status)) {
    status = memx_open(MODEL_1_ID, GROUP_ID, MEMX_DEVICE_CASCADE);
  }

  // 2. Run two models simultaneously using posix threads (Linux only)
  if (memx_status_no_error(status)) {
    if ((pthread_create(&t0, NULL, &run_inference_model_0, NULL) != 0)
      ||(pthread_create(&t1, NULL, &run_inference_model_1, NULL) != 0)) {
      status = MEMX_STATUS_OTHERS;
    }
  }
  if (memx_status_no_error(status)) {
    pthread_join(t0, NULL);
    pthread_join(t1, NULL);
  }

  // 3. Always remember to clean-up resources before leaving.
  memx_close(MODEL_0_ID);
  memx_close(MODEL_1_ID);

  // End of process
  if (memx_status_no_error(status)) {
    printf("success.\n");
  } else {
    printf("failure.\n");
  }
  return 0;
}
def main():
  '''Main process, create two threads to run inferences in parallel.'''
  err = 0

  # 1. Bind MPU device group 0 as MX3:Cascade to both model 0 and model 1.
  if not err:
    err = mxa.open(MODEL_0_ID, GROUP_ID, 3) # 3 = MX3:Cascade
  if not err:
    err = mxa.open(MODEL_1_ID, GROUP_ID, 3) # 3 = MX3:Cascade

  # 2. Run two models simultaneously using threads
  t0 = threading.Thread(target=run_inference_model_0, args=())
  t1 = threading.Thread(target=run_inference_model_1, args=())
  t0.start()
  t1.start()
  t0.join()
  t1.join()

  # 3. Always remember to clean-up resources before leaving.
  mxa.close(MODEL_0_ID)
  mxa.close(MODEL_1_ID)

  # End of process
  if not err:
    print("success.")
  else:
    print("failure.")

if __name__ == "__main__":
  main()

In the top level of application, we put open() and close() outside of our model inference sub-routines, because we do not want sub-routines to re-configure interface each time when it acquires lock and starts to run inference. Here we create two threads to demonstrate how to run two models in parallel using lock mechanism to avoid hardware access conflict.

Note

Remember the number of open() and close() should be the same. And yes, you can put open() and close() inside sub-routines if you want to.

3.4 Execution Result#

Execution output of this example is the same as 2.4 Execution Result with only file name is different.

Example 4: Runtime Device Selection#

4.1 Run Inference Sub-routine#

  // 1. Here we use 'trylock()' first to test if device is acquirable, if the
  // return value is '0' means lock is acquired successfully, otherwise we move
  // on to lock another device.
  if (memx_trylock(GROUP_0_ID) == 0) {
    config->group_id = GROUP_0_ID;
  } else {
    memx_lock(GROUP_1_ID); // wait until lock is acquired
    config->group_id = GROUP_1_ID;
  }
  // 2. Re-configure MPU device group binded to model.
  memx_status status = memx_reconfigure(config->model_id, config->group_id);
  printf(" - Model %u is running on device %u\n",
    config->model_id, config->group_id);
  # 1. Here we use 'trylock()' first to test if device is acquirable, if the
  # return value is '0' means lock is acquired successfully, otherwise we move
  # on to lock another device.
  if mxa.trylock(GROUP_0_ID) == 0:
    config.group_id = GROUP_0_ID
  else:
    mxa.lock(GROUP_1_ID); # wait until lock is acquired
    config.group_id = GROUP_1_ID
  # 2. Re-configure MPU device group binded to model.
  err = mxa.reconfigure(config.model_id, config.group_id)
  print(" - Model {} is running on device {}"
    .format(config.model_id, config.group_id))

Apart from 3.1 Run Inference Sub-routine, in this example we use trylock() before lock() to test if device 0 is available. If device 0 is currently locked by other model then we will move on to device 1. This time we will wait until lock is truly acquired.

After lock is acquired, it is important to call reconfigure() to bind and setup interface just like what we do in open(). Later after inference, remember to unlock the device which we selected during runtime.

4.2 Run Inference By Model#

This part is basically the same as 3.2 Run Inference By Model.

4.3 Application Main#

  // 1. Bind MPU device group 0 as MX3:Cascade to both model 0 and model 1.
  // Group ID here can be any of MPU device group during initialization, since
  // runtime 'reconfigure()' will also initialize interface as 'open()'.
  if (memx_status_no_error(status)) {
    status = memx_open(MODEL_0_ID, GROUP_0_ID, MEMX_DEVICE_CASCADE);
  }
  if (memx_status_no_error(status)) {
    status = memx_open(MODEL_1_ID, GROUP_0_ID, MEMX_DEVICE_CASCADE);
  }
  # 1. Bind MPU device group 0 as MX3:Cascade to both model 0 and model 1.
  # Group ID here can be any of MPU device group during initialization, since
  # runtime 'reconfigure()' will also initialize interface as 'open()'.
  if not err:
    err = mxa.open(MODEL_0_ID, GROUP_0_ID, 3) # 3 = MX3:Cascade
  if not err:
    err = mxa.open(MODEL_1_ID, GROUP_0_ID, 3) # 3 = MX3:Cascade

It does not really matter which device group we bind to model during the time we call open() to do initialization, since we will later use reconfigure() to setup interface anyway. As a result, here we use both device 0 as default interface, otherwise remains the same as 3.3 Application Main.

4.4 Execution Result#

$ mkdir build && cd build && cmake .. && make && cd ..
$ sudo ./build/memx_c_example4
 - Model 0 is running on device 0
 - Model 1 is running on device 1
 - Model 0 argmax = 92
 - Model 1 argmax = 284
success.
$ sudo python3 memx_py_example4.py
 - Model 0 is running on device 0
 - Model 1 is running on device 1
 - Model 0 argmax = 92
 - Model 1 argmax = 284
success.

Here we can see two more messages than 2.4 Execution Result that tell us that which model is running on which device. When second model failed to use trylock() to lock device 0, it automatically moves on to device 1. And that is the behaviour what we expected.

Resources#