GPU プラグインのリモートテンソル API#

ov::RemoteContext インターフェイスと ov::RemoteTensor インターフェイスの GPU プラグイン実装は、ビデオメモリーの共有と、OpenCL*、Microsoft DirectX*、VAAPI など既存のネイティブ API との相互運用性を必要とする GPU パイプラインの開発者をサポートします。

ov::RemoteContext および ov::RemoteTensor インターフェイスの実装は、メモリー共有の必要性と、OpenCL*、Microsoft DirectX*、VAAPI など既存のネイティブ API との相互運用性を対象としています。これにより、OpenVINO™ 推論を既存の GPU パイプラインに接続する際のメモリーコピーのオーバーヘッドを回避できます。また、OpenCL* カーネルがパイプラインに参加して、OpenVINO™ 推論のネイティブバッファーの消費者または生産者になることも可能です。

リモート Tensor API でサポートされる相互運用性シナリオは 2 つあります:

GPU プラグイン・コンテキストとメモリー・オブジェクトは、低レベルのデバイス、ディスプレイ、またはメモリーハンドルから構築でき、OpenVINO™ ov::CompiledModel または ov::Tensor オブジェクトの作成に使用できます。
OpenCL コンテキストまたはバッファーハンドルは、既存の GPU プラグイン・オブジェクトから取得でき、アプリケーションの OpenCL* 処理で使用できます。

API のクラスと関数の宣言は、次のファイルで定義されます:

Windows* – openvino/runtime/intel_gpu/ocl/ocl.hpp および openvino/runtime/intel_gpu/ocl/dx.hpp
Linux＊ – openvino/runtime/intel_gpu/ocl/ocl.hpp および openvino/runtime/intel_gpu/ocl/va.hpp

アプリケーションとリモート Tensor API の対話を可能にする一般的な方法は、ネイティブハンドルを直接消費または生産するユーザーのユーティリティー・クラスと関数を使用することです。

アプリケーションと GPU プラグイン間のコンテキスト共有#

ov::RemoteContext インターフェイスを実装する GPU プラグインクラスは、コンテキストの共有を担当します。コンテキスト・オブジェクトを取得するのは、パイプライン・オブジェクトを共有する最初のステップです。GPU プラグインのコンテキスト・オブジェクトは OpenCL* コンテキストを直接ラップし、ov::CompiledModel オブジェクトと ov::RemoteTensor オブジェクトを共有するスコープを設定します。ov::RemoteContext オブジェクトは、ネイティブ API から既存のハンドル上に作成することも、GPU プラグインから取得することもできます。

コンテキストを取得したら、それを使用して新しい ov::CompiledModel をコンパイルしたり、ov::RemoteTensor オブジェクトを作成したりできます。ネットワーク・コンパイルの場合は、追加パラメーターとしてコンテキストを受け入れる専用の ov::Core::compile_model() フレーバーを使用します。

ネイティブハンドルから RemoteContext を作成#

ユーザー・コンテキストの ov::RemoteContext オブジェクトを作成するには、ov::RemoteContext 派生クラスのコンストラクターを使用して、プラグインにコンテキストを明示的に提供します。

プラグインから RemoteContext を取得#

ユーザー・コンテキストを省略すると、プラグインはデフォルトの内部コンテキストを使用します。プラグインは、プラグインオプションが同じである限り、同じ内部コンテキスト・オブジェクトを使用します。したがって、この間に作成されたすべての ov::CompiledModel オブジェクトは同じコンテキストを共有します。プラグインオプションが変更されると、内部コンテキストは新しいコンテキストに置き換えられます。

プラグインの現在のデフォルト・コンテキストをリクエストするには、次のいずれかの方法を使用します:

C++

コアからコンテキストを取得

 auto gpu_context = 
    core.get_default_context("GPU").as<ov::intel_gpu::ocl::ClContext>(); 
// RemoteContext から ocl コンテキスト・ハンドルを抽出 
cl_context context_handle = gpu_context.get();

コンパイルされたモデルからコンテキストを取得

 auto gpu_context = 
    compiled_model.get_context().as<ov::intel_gpu::ocl::ClContext>(); 
// RemoteContext から ocl コンテキスト・ハンドルを抽出 
cl_context context_handle = gpu_context.get();

コアからコンテキストを取得

 ov_core_get_default_context(core, "GPU", &gpu_context); 
// RemoteContext から ocl コンテキスト・ハンドルを抽出 
size_t size = 0; 
char* params = nullptr; 
// params は以下の形式: "CONTEXT_TYPE OCL OCL_CONTEXT 0x5583b2ec7b40 OCL_QUEUE 0x5583b2e98ff0" 
// 解析する必要があります。 
ov_remote_context_get_params(gpu_context, &size, &params);

コンパイルされたモデルからコンテキストを取得

 ov_compiled_model_get_context(compiled_model, &gpu_context); 
// RemoteContext から ocl コンテキスト・ハンドルを抽出 
size_t size = 0; 
char* params = nullptr; 
// params は以下の形式: "CONTEXT_TYPE OCL OCL_CONTEXT 0x5583b2ec7b40 OCL_QUEUE 0x5583b2e98ff0" 
// 解析する必要があります。 
ov_remote_context_get_params(gpu_context, &size, &params);

コンテキストとキューの共有#

GPU プラグインは、cl_command_queue ハンドルから共有コンテキストを作成することをサポートします。その場合、opencl コンテキスト・ハンドルは OpenCL* API を介して指定されたキューから抽出され、キュー自体は推論プリミティブをさらに実行するためプラグイン内で使用されます。キューを共有することで、ov::InferRequest::start_async() メソッドの動作が変更され、呼び出しスレッドに制御を戻す前に、指定されたキューへの推論プリミティブの送信完了が保証されます。

この共有メカニズムにより、アプリ側でパイプライン同期を実行し、推論の完了を待機する際のホストスレッドのブロックを回避できます。疑似コードは次のようになります:

キューとコンテキスト共有の例

// ...
// コアを初期化し、モデルを読み込む 
ov::Core core; 
auto model = core.read_model("model.xml"); 

// Opencl キューオブジェクトを取得 
cl::CommandQueue queue = get_ocl_queue(); 
cl::Context cl_context = get_ocl_context(); 

// GPU プラグインとキューを共有し、モデルをコンパイル 
auto remote_context = ov::intel_gpu::ocl::ClContext(core, queue.get()); 
auto exec_net_shared = core.compile_model(model, remote_context); 

auto input = model->get_parameters().at(0); 
auto input_size = ov::shape_size(input->get_shape()); 
auto output = model->get_results().at(0); 
auto output_size = ov::shape_size(output->get_shape()); 
cl_int err; 

// コンテキスト内に OpenCL バッファーを作成 
cl::Buffer shared_in_buffer(cl_context, CL_MEM_READ_WRITE, input_size, NULL, &err); 
cl::Buffer shared_out_buffer(cl_context, CL_MEM_READ_WRITE, output_size, NULL, &err); 
// バッファーを RemoteTensor にラップし、リクエストを推測するように設定 
auto shared_in_blob = remote_context.create_tensor(input->get_element_type(), input->get_shape(), shared_in_buffer); 
auto shared_out_blob = remote_context.create_tensor(output->get_element_type(), output->get_shape(), shared_out_buffer); auto infer_request = exec_net_shared.create_infer_request(); 
infer_request.set_tensor(input, shared_in_blob); 
infer_request.set_tensor(output, shared_out_blob); 

// ...
// ユーザーカーネルを実行 
cl::Program program; 
cl::Kernel kernel_preproc(program, "user_kernel_preproc"); 
kernel_preproc.setArg(0, shared_in_buffer); 
queue.enqueueNDRangeKernel(kernel_preproc, 
                           cl::NDRange(0), 
                           cl::NDRange(input_size), 
                           cl::NDRange(1), 
                           nullptr, 
                           nullptr); 
// clFinish() 呼び出しをブロックする必要はありませんが、推論プリミティブが開始される前に 
// ユーザーカーネルが終了することを保証するために、このバリアがキューに追加されます 
queue.enqueueBarrierWithWaitList(nullptr, nullptr); 
// ...
// 結果を推論に渡す 
// リモート・コンテキストはキュー共有で作成されるため、start_async() はスケジューリングが終了することを保証します 
infer_request.start_async(); 

// いくつかの後処理カーネルを実行します。
// infer_request.wait() は呼び出されず、推論と後処理間の同期は 
// enqueueBarrierWithWaitList 呼び出しを介して行われます。 
cl::Kernel kernel_postproc(program, "user_kernel_postproc"); 
kernel_postproc.setArg(0, shared_out_buffer); 
queue.enqueueBarrierWithWaitList(nullptr, nullptr); 
queue.enqueueNDRangeKernel(kernel_postproc, 
                           cl::NDRange(0), 
                           cl::NDRange(output_size), 
                           cl::NDRange(1), 
                           nullptr, 
                           nullptr); 

// パイプラインの完了を待機 
queue.finish();

制限事項#

GPU プラグインの一部のプリミティブは、カーネルをコマンドキューに追加する前に、前のプリミティブを待機してホストスレッドをブロックする場合があります。そのような場合、ov::InferRequest::start_async() 呼び出しは、ネットワークの (部分的または完全な) 完了を内部で待機するため、呼び出し元のスレッドに制御を返すまでにさらに時間を要します。操作の例: Loop、TensorIterator、DetectionOutput、NonMaxSuppression
共有キュー内の前後処理ジョブと推論パイプラインの同期はユーザーの責任です。
キュー共有が使用されている場合、スループット・モードは使用できません。つまり、コンパイルされたモデルごとに 1 つのストリームのみを使用できます。

RemoteContext および RemoteTensor 作成向けの低レベルメソッド#

前述した高レベルのラッパーは、ネイティブ API への直接依存関係をユーザー・プログラムにもたらします。依存関係を回避したい場合でも、ov::Core::create_context()、ov::RemoteContext::create_tensor()、および ov::RemoteContext::get_params() メソッドを直接使用できます。このレベルでは、ネイティブハンドルは void ポインターとして再解釈され、すべての引数は std::string と ov::Any のペアで満たされた ov::AnyMap コンテナで渡されます。記述子とコンテナという 2 種類のマップエントリーが可能です。記述子はマップの予期される構造と可能なパラメーター値を設定します。

使用可能な低レベルのプロパティーとその説明については、ヘッダーファイル (remote_properties.hpp) を参照してください。

例#

使用例の疑似コードについては、以下のセクションを参照してください。

注

低レベルのパラメーターの使用例については、上記のインクルード・ファイルのユーザー側ラッパーのソースコードを参照してください。

Linux* での NV12 VAAPI ビデオ・デコーダー・サーフェスの直接使用

C++

// ... 

using namespace ov::preprocess; 
auto p = PrePostProcessor(model); 
p.input().tensor().set_element_type(ov::element::u8) 
.set_color_format(ov::preprocess::ColorFormat::NV12_TWO_PLANES, {"y", "uv"}) 
                  .set_memory_type(ov::intel_gpu::memory_type::surface); 
p.input().preprocess().convert_color(ov::preprocess::ColorFormat::BGR); 
p.input().model().set_layout("NCHW"); 
model = p.build(); 

VADisplay disp = get_va_display(); 
// 共有コンテキスト・オブジェクトを作成 
auto shared_va_context = ov::intel_gpu::ocl::VAContext(core, disp); 
// 共有コンテキスト内でモデルをコンパイル 
auto compiled_model = core.compile_model(model, shared_va_context); 

auto input0 = model->get_parameters().at(0); 
auto input1 = model->get_parameters().at(1); 

auto shape = input0->get_shape(); 
auto width = shape[1]; 
auto height = shape[2]; 

// デコードを実行し、デコードされたサーフェスハンドルを取得 
VASurfaceID va_surface = decode_va_surface(); 
// ...
// デコーダーの出力を RemoteBlobs にラップし、推論入力として設定 
auto nv12_blob = shared_va_context.create_tensor_nv12(height, width, va_surface); 

auto infer_request = compiled_model.create_infer_request(); 
infer_request.set_tensor(input0->get_friendly_name(), nv12_blob.first); 
infer_request.set_tensor(input1->get_friendly_name(), nv12_blob.second); 
infer_request.start_async(); 
infer_request.wait();

// ... 

ov_preprocess_prepostprocessor_create(model, &preprocess); 
ov_preprocess_prepostprocessor_get_input_info(preprocess, &preprocess_input_info); 
ov_preprocess_input_info_get_tensor_info(preprocess_input_info, &preprocess_input_tensor_info); 

ov_preprocess_input_tensor_info_set_element_type(preprocess_input_tensor_info, U8); 
ov_preprocess_input_tensor_info_set_color_format_with_subname(preprocess_input_tensor_info, NV12_TWO_PLANES, 2, "y", "uv"); 

ov_preprocess_input_tensor_info_set_memory_type(preprocess_input_tensor_info, "GPU_SURFACE"); 

ov_preprocess_input_tensor_info_set_spatial_static_shape(preprocess_input_tensor_info, height, width); 
ov_preprocess_input_info_get_preprocess_steps(preprocess_input_info, &preprocess_input_steps); 
ov_preprocess_preprocess_steps_convert_color(preprocess_input_steps, BGR); 
ov_preprocess_preprocess_steps_resize(preprocess_input_steps, RESIZE_LINEAR); 
ov_preprocess_input_info_get_model_info(preprocess_input_info, &preprocess_input_model_info); 
ov_layout_create("NCHW", &layout); 
ov_preprocess_input_model_info_set_layout(preprocess_input_model_info, layout); 
ov_preprocess_prepostprocessor_build(preprocess, &new_model); 

VADisplay display = get_va_display(); 
// 共有コンテキスト・オブジェクトを作成 
ov_core_create_context(core, 
                      "GPU", 
                      4, 
                      &shared_va_context, 
                      ov_property_key_intel_gpu_context_type, 
                      "VA_SHARED", 
                      ov_property_key_intel_gpu_va_device, 
                      display); 

// 共有コンテキスト内でモデルをコンパイル 
ov_core_compile_model_with_context(core, new_model, shared_va_context, 0, &compiled_model); 

ov_output_const_port_t* port_0 = NULL; 
char* input_name_0 = NULL; 
ov_model_const_input_by_index(new_model, 0, &port_0); 
ov_port_get_any_name(port_0, &input_name_0); 

ov_output_const_port_t* port_1 = NULL; 
char* input_name_1 = NULL; 
ov_model_const_input_by_index(new_model, 1, &port_1); 
ov_port_get_any_name(port_1, &input_name_1); 

ov_shape_t shape_y = {0, NULL}; 
ov_shape_t shape_uv = {0, NULL}; 
ov_const_port_get_shape(port_0, &shape_y); 
ov_const_port_get_shape(port_1, &shape_uv); 

// デコードを実行し、デコードされたサーフェスハンドルを取得 
VASurfaceID va_surface = decode_va_surface(); 
// ...
// デコーダーの出力を RemoteBlobs にラップし、推論入力として設定  

ov_tensor_t* remote_tensor_y = NULL; 
ov_tensor_t* remote_tensor_uv = NULL; 
ov_remote_context_create_tensor(shared_va_context, 
                                U8, 
                                shape_y, 
                                6, 
                                &remote_tensor_y, 
                                ov_property_key_intel_gpu_shared_mem_type, 
                                "VA_SURFACE", 
                                ov_property_key_intel_gpu_dev_object_handle, 
                                va_surface, 
                                ov_property_key_intel_gpu_va_plane, 
                                0); 

ov_remote_context_create_tensor(shared_va_context, 
                                U8, 
                                shape_uv, 
                                6, 
                                &remote_tensor_uv, 
                                ov_property_key_intel_gpu_shared_mem_type, 
                                "VA_SURFACE", 
                                ov_property_key_intel_gpu_dev_object_handle, 
                                va_surface, 
                                ov_property_key_intel_gpu_va_plane, 
                                1); 

ov_compiled_model_create_infer_request(compiled_model, &infer_request); 
ov_infer_request_set_tensor(infer_request, input_name_0, remote_tensor_y); 
ov_infer_request_set_tensor(infer_request, input_name_1, remote_tensor_uv); 
ov_infer_request_infer(infer_request);

GPU プラグインのリモートテンソル API#

NV12 ビデオサーフェスの直接入力#

RemoteContext および RemoteTensor 作成向けの低レベルメソッド#

例#

関連情報#

GPU プラグインのリモートテンソル API#

アプリケーションと GPU プラグイン間のコンテキスト共有#

ネイティブハンドルから RemoteContext を作成#

プラグインから RemoteContext を取得#

アプリケーションと GPU プラグイン間のメモリー共有#

NV12 ビデオサーフェスの直接入力#

コンテキストとキューの共有#

制限事項#

RemoteContext および RemoteTensor 作成向けの低レベルメソッド#

例#

関連情報#