OpenVINO™ と第 4 世代インテル® Xeon® スケーラブル・プロセッサーによるスパース・トランスフォーマー・モデルの推論の高速化

この Jupyter ノートブックはオンラインで起動でき、ブラウザーのウィンドウで対話型環境を開きます。ローカルにインストールすることもできます。次のオプションのいずれかを選択します。

Google Colab GitHub

このチュートリアルでは、第 4 世代インテル® Xeon® スケーラブル・プロセッサー上の OpenVINO を使用してスパース・トランスフォーマー・モデルのパフォーマンスを向上させる方法を説明します。

このチュートリアルでは、Optimum-Intel を使用して量子化、スパース化、および SST2 データセット用に調整された BERT ベースのモデルをダウンロードします。これは、モデルのスパース性を利用して効率を高めるランタイム・オプションであるスパース・ウェイト・デコンプレッションを使用して実行することにより、第 4 世代インテル® Xeon® スケーラブル・プロセッサーでの推論パフォーマンスの利点を実証します。ノートブックは次の手順で構成されます。

  • 前提条件をインストールします。

  • OpenVINO と Hugging Face Optimum の統合を使用して、スパースパブリック BERT モデルをダウンロードして量子化します。

  • 疎密度 8 ビットと高密度 8 ビットの推論パフォーマンスを比較します。

目次

必要条件

%pip install -q "openvino>=2023.1.0"
%pip install -q "git+https://github.com/huggingface/optimum-intel.git" datasets onnx transformers>=4.33.0 --extra-index-url https://download.pytorch.org/whl/cpu
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

インポート

import shutil
from pathlib import Path

from optimum.intel.openvino import OVModelForSequenceClassification
from transformers import AutoTokenizer, pipeline
from huggingface_hub import hf_hub_download
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
2024-02-09 23:02:05.779349: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-02-09 23:02:05.814537: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-09 23:02:06.378496: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

Hugging Face Optimum API を使用して、モデルをダウンロード、量子化、スパース化

最初のステップは、OpenVINO IR に変換された量子化されたスパース・トランスフォーマーをダウンロードすることです。次に、ダウンロードしたモデルが動作するかどうか単純な検証として分類されます。モデルがどのように量子化およびスパース化されているかを確認するには、Hugging Face の OpenVINO/bert-base-uncased-sst2-int8-unstructured80 モデルカードを参照してください。

# The following model has been quantized, sparsified using Optimum-Intel 1.7 which is enabled by OpenVINO and NNCF
# for reproducibility, refer https://huggingface.co/OpenVINO/bert-base-uncased-sst2-int8-unstructured80
model_id = "OpenVINO/bert-base-uncased-sst2-int8-unstructured80"

# The following two steps will set up the model and download them to HF Cache folder
ov_model = OVModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Let's take the model for a spin!
sentiment_classifier = pipeline("text-classification", model=ov_model, tokenizer=tokenizer)

text = "He's a dreadful magician."
outputs = sentiment_classifier(text)

print(outputs)
Compiling the model to CPU ...
device must be of type <class 'str'> but got <class 'torch.device'> instead
[{'label': 'negative', 'score': 0.9982142448425293}]

ベンチマークには、OpenVINO のベンチマーク・アプリケーションを使用し、IR を 1 つのフォルダーに配置します。

# create a folder
quantized_sparse_dir = Path("bert_80pc_sparse_quantized_ir")
quantized_sparse_dir.mkdir(parents=True, exist_ok=True)

# following return path to specified filename in cache folder (which we've with the
ov_ir_xml_path = hf_hub_download(repo_id=model_id, filename="openvino_model.xml")
ov_ir_bin_path = hf_hub_download(repo_id=model_id, filename="openvino_model.bin")

# copy IRs to the folder
shutil.copy(ov_ir_xml_path, quantized_sparse_dir)
shutil.copy(ov_ir_bin_path, quantized_sparse_dir)
'bert_80pc_sparse_quantized_ir/openvino_model.bin'

量子化された密な推論パフォーマンスのベンチマーク

4 つの CPU コアでの並列実行を使用して、クラウド・インフラストラクチャー内の小規模なインスタンスをシミュレートして、高密度推論のパフォーマンスをベンチマークします。シーケンス長はユースケースによって異なります。会話型 AI では 16 が一般的ですが、質問応答タスクでは 160 が一般的です。例として64に設定します。アプリケーションに基づいて調整することをお勧めします。

# Dump benchmarking config for dense inference
with (quantized_sparse_dir / "perf_config.json").open("w") as outfile:
    outfile.write(
        """
        {
            "CPU": {"NUM_STREAMS": 4, "INFERENCE_NUM_THREADS": 4}
        }
        """
    )
!benchmark_app -m $quantized_sparse_dir/openvino_model.xml -shape "input_ids[1,64],attention_mask[1,64],token_type_ids[1,64]" -load_config $quantized_sparse_dir/perf_config.json
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using tokenizers before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2023.3.0-13775-ceeafaf64f3-releases/2023/3
[ INFO ]
[ INFO ] Device info:
[ INFO ] CPU
[ INFO ] Build ................................. 2023.3.0-13775-ceeafaf64f3-releases/2023/3
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Performance hint was not explicitly specified in command line. Device(CPU) performance hint will be set to PerformanceMode.THROUGHPUT.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 62.38 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     input_ids (node: input_ids) : i64 / [...] / [?,?]
[ INFO ]     attention_mask (node: attention_mask) : i64 / [...] / [?,?]
[ INFO ]     token_type_ids (node: token_type_ids) : i64 / [...] / [?,?]
[ INFO ] Model outputs:
[ INFO ]     logits (node: logits) : f32 / [...] / [?,2]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 1
[ INFO ] Reshaping model: 'input_ids': [1,64], 'attention_mask': [1,64], 'token_type_ids': [1,64]
[ INFO ] Reshape model took 23.14 ms
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ]     input_ids (node: input_ids) : i64 / [...] / [1,64]
[ INFO ]     attention_mask (node: attention_mask) : i64 / [...] / [1,64]
[ INFO ]     token_type_ids (node: token_type_ids) : i64 / [...] / [1,64]
[ INFO ] Model outputs:
[ INFO ]     logits (node: logits) : f32 / [...] / [1,2]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 1107.64 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ]   NETWORK_NAME: torch_jit
[ INFO ]   OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ]   NUM_STREAMS: 4
[ INFO ]   AFFINITY: Affinity.CORE
[ INFO ]   INFERENCE_NUM_THREADS: 4
[ INFO ]   PERF_COUNT: NO
[ INFO ]   INFERENCE_PRECISION_HINT: <Type: 'float32'>
[ INFO ]   PERFORMANCE_HINT: THROUGHPUT
[ INFO ]   EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]   ENABLE_CPU_PINNING: True
[ INFO ]   SCHEDULING_CORE_TYPE: SchedulingCoreType.ANY_CORE
[ INFO ]   ENABLE_HYPER_THREADING: True
[ INFO ]   EXECUTION_DEVICES: ['CPU']
[ INFO ]   CPU_DENORMALS_OPTIMIZATION: False
[ INFO ]   CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE: 1.0
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'input_ids'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'attention_mask'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'token_type_ids'!. This input will be filled with random values!
[ INFO ] Fill input 'input_ids' with random values
[ INFO ] Fill input 'attention_mask' with random values
[ INFO ] Fill input 'token_type_ids' with random values
[Step 10/11] Measuring performance (Start inference asynchronously, 4 inference requests, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 30.14 ms
[Step 11/11] Dumping statistics report
[ INFO ] Execution Devices:['CPU']
[ INFO ] Count:            8852 iterations
[ INFO ] Duration:         60038.32 ms
[ INFO ] Latency:
[ INFO ]    Median:        26.79 ms
[ INFO ]    Average:       26.86 ms
[ INFO ]    Min:           24.76 ms
[ INFO ]    Max:           42.20 ms
[ INFO ] Throughput:   147.44 FPS

量子化された疎な推論パフォーマンスのベンチマーク

疎な重み解凍機能を有効にするには、以下のように実行時設定に追加できます。CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE は 0.5 ~ 1.0 の値をとります。これは、レイヤーが有効になるレイヤーレベルの疎なしきい値です。

# Dump benchmarking config for dense inference
# "CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE" controls minimum sparsity rate for weights to consider
# for sparse optimization at the runtime.
with (quantized_sparse_dir / "perf_config_sparse.json").open("w") as outfile:
    outfile.write(
        """
        {
            "CPU": {"NUM_STREAMS": 4, "INFERENCE_NUM_THREADS": 4, "CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE": 0.75}
        }
        """
    )
!benchmark_app -m $quantized_sparse_dir/openvino_model.xml -shape "input_ids[1,64],attention_mask[1,64],token_type_ids[1,64]" -load_config $quantized_sparse_dir/perf_config_sparse.json
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using tokenizers before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2023.3.0-13775-ceeafaf64f3-releases/2023/3
[ INFO ]
[ INFO ] Device info:
[ INFO ] CPU
[ INFO ] Build ................................. 2023.3.0-13775-ceeafaf64f3-releases/2023/3
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Performance hint was not explicitly specified in command line. Device(CPU) performance hint will be set to PerformanceMode.THROUGHPUT.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 71.12 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     input_ids (node: input_ids) : i64 / [...] / [?,?]
[ INFO ]     attention_mask (node: attention_mask) : i64 / [...] / [?,?]
[ INFO ]     token_type_ids (node: token_type_ids) : i64 / [...] / [?,?]
[ INFO ] Model outputs:
[ INFO ]     logits (node: logits) : f32 / [...] / [?,2]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 1
[ INFO ] Reshaping model: 'input_ids': [1,64], 'attention_mask': [1,64], 'token_type_ids': [1,64]
[ INFO ] Reshape model took 23.54 ms
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ]     input_ids (node: input_ids) : i64 / [...] / [1,64]
[ INFO ]     attention_mask (node: attention_mask) : i64 / [...] / [1,64]
[ INFO ]     token_type_ids (node: token_type_ids) : i64 / [...] / [1,64]
[ INFO ] Model outputs:
[ INFO ]     logits (node: logits) : f32 / [...] / [1,2]
[Step 7/11] Loading the model to the device
[ ERROR ] Exception from src/inference/src/core.cpp:99:
[ GENERAL_ERROR ] Exception from src/plugins/intel_cpu/src/config.cpp:158:
Wrong value for property key CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE. Expected only float numbers

Traceback (most recent call last):
  File "/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/openvino/tools/benchmark/main.py", line 408, in main
    compiled_model = benchmark.core.compile_model(model, benchmark.device, device_config)
  File "/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/openvino/runtime/ie_api.py", line 547, in compile_model
    super().compile_model(model, device_name, {} if config is None else config),
RuntimeError: Exception from src/inference/src/core.cpp:99:
[ GENERAL_ERROR ] Exception from src/plugins/intel_cpu/src/config.cpp:158:
Wrong value for property key CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE. Expected only float numbers

これが役立つ場合

この機能により、複数の要求を非同期で並行して処理するようにモデルがデプロイされているシナリオで、疎な重みを持つモデルの推論パフォーマンスを向上させることができます。これは、シーケンスの長さが短い場合 (例えば、32 以下の場合) に特に有効です。

OpenVINO を使用した非同期推論の詳細については、次のドキュメントを参照してください。