OpenVINO™ と第 4 世代インテル® Xeon® スケーラブル・プロセッサーによるスパース・トランスフォーマー・モデルの推論の高速化¶
この Jupyter ノートブックはオンラインで起動でき、ブラウザーのウィンドウで対話型環境を開きます。ローカルにインストールすることもできます。次のオプションのいずれかを選択します。
このチュートリアルでは、第 4 世代インテル® Xeon® スケーラブル・プロセッサー上の OpenVINO を使用してスパース・トランスフォーマー・モデルのパフォーマンスを向上させる方法を説明します。
このチュートリアルでは、Optimum-Intel を使用して量子化、スパース化、および SST2 データセット用に調整された BERT ベースのモデルをダウンロードします。これは、モデルのスパース性を利用して効率を高めるランタイム・オプションであるスパース・ウェイト・デコンプレッションを使用して実行することにより、第 4 世代インテル® Xeon® スケーラブル・プロセッサーでの推論パフォーマンスの利点を実証します。ノートブックは次の手順で構成されます。
前提条件をインストールします。
OpenVINO と Hugging Face Optimum の統合を使用して、スパースパブリック BERT モデルをダウンロードして量子化します。
疎密度 8 ビットと高密度 8 ビットの推論パフォーマンスを比較します。
目次¶
必要条件¶
%pip install -q "openvino>=2023.1.0"
%pip install -q "git+https://github.com/huggingface/optimum-intel.git" datasets onnx transformers>=4.33.0 --extra-index-url https://download.pytorch.org/whl/cpu
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
インポート¶
import shutil
from pathlib import Path
from optimum.intel.openvino import OVModelForSequenceClassification
from transformers import AutoTokenizer, pipeline
from huggingface_hub import hf_hub_download
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
2024-02-09 23:02:05.779349: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2024-02-09 23:02:05.814537: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-09 23:02:06.378496: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Hugging Face Optimum API を使用して、モデルをダウンロード、量子化、スパース化¶
最初のステップは、OpenVINO IR に変換された量子化されたスパース・トランスフォーマーをダウンロードすることです。次に、ダウンロードしたモデルが動作するかどうか単純な検証として分類されます。モデルがどのように量子化およびスパース化されているかを確認するには、Hugging Face の OpenVINO/bert-base-uncased-sst2-int8-unstructured80 モデルカードを参照してください。
# The following model has been quantized, sparsified using Optimum-Intel 1.7 which is enabled by OpenVINO and NNCF
# for reproducibility, refer https://huggingface.co/OpenVINO/bert-base-uncased-sst2-int8-unstructured80
model_id = "OpenVINO/bert-base-uncased-sst2-int8-unstructured80"
# The following two steps will set up the model and download them to HF Cache folder
ov_model = OVModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Let's take the model for a spin!
sentiment_classifier = pipeline("text-classification", model=ov_model, tokenizer=tokenizer)
text = "He's a dreadful magician."
outputs = sentiment_classifier(text)
print(outputs)
Compiling the model to CPU ...
device must be of type <class 'str'> but got <class 'torch.device'> instead
[{'label': 'negative', 'score': 0.9982142448425293}]
ベンチマークには、OpenVINO のベンチマーク・アプリケーションを使用し、IR を 1 つのフォルダーに配置します。
# create a folder
quantized_sparse_dir = Path("bert_80pc_sparse_quantized_ir")
quantized_sparse_dir.mkdir(parents=True, exist_ok=True)
# following return path to specified filename in cache folder (which we've with the
ov_ir_xml_path = hf_hub_download(repo_id=model_id, filename="openvino_model.xml")
ov_ir_bin_path = hf_hub_download(repo_id=model_id, filename="openvino_model.bin")
# copy IRs to the folder
shutil.copy(ov_ir_xml_path, quantized_sparse_dir)
shutil.copy(ov_ir_bin_path, quantized_sparse_dir)
'bert_80pc_sparse_quantized_ir/openvino_model.bin'
量子化された密な推論パフォーマンスのベンチマーク¶
4 つの CPU コアでの並列実行を使用して、クラウド・インフラストラクチャー内の小規模なインスタンスをシミュレートして、高密度推論のパフォーマンスをベンチマークします。シーケンス長はユースケースによって異なります。会話型 AI では 16 が一般的ですが、質問応答タスクでは 160 が一般的です。例として64に設定します。アプリケーションに基づいて調整することをお勧めします。
# Dump benchmarking config for dense inference
with (quantized_sparse_dir / "perf_config.json").open("w") as outfile:
outfile.write(
"""
{
"CPU": {"NUM_STREAMS": 4, "INFERENCE_NUM_THREADS": 4}
}
"""
)
!benchmark_app -m $quantized_sparse_dir/openvino_model.xml -shape "input_ids[1,64],attention_mask[1,64],token_type_ids[1,64]" -load_config $quantized_sparse_dir/perf_config.json
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2023.3.0-13775-ceeafaf64f3-releases/2023/3
[ INFO ]
[ INFO ] Device info:
[ INFO ] CPU
[ INFO ] Build ................................. 2023.3.0-13775-ceeafaf64f3-releases/2023/3
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Performance hint was not explicitly specified in command line. Device(CPU) performance hint will be set to PerformanceMode.THROUGHPUT.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 62.38 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ] input_ids (node: input_ids) : i64 / [...] / [?,?]
[ INFO ] attention_mask (node: attention_mask) : i64 / [...] / [?,?]
[ INFO ] token_type_ids (node: token_type_ids) : i64 / [...] / [?,?]
[ INFO ] Model outputs:
[ INFO ] logits (node: logits) : f32 / [...] / [?,2]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 1
[ INFO ] Reshaping model: 'input_ids': [1,64], 'attention_mask': [1,64], 'token_type_ids': [1,64]
[ INFO ] Reshape model took 23.14 ms
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ] input_ids (node: input_ids) : i64 / [...] / [1,64]
[ INFO ] attention_mask (node: attention_mask) : i64 / [...] / [1,64]
[ INFO ] token_type_ids (node: token_type_ids) : i64 / [...] / [1,64]
[ INFO ] Model outputs:
[ INFO ] logits (node: logits) : f32 / [...] / [1,2]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 1107.64 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ] NETWORK_NAME: torch_jit
[ INFO ] OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ] NUM_STREAMS: 4
[ INFO ] AFFINITY: Affinity.CORE
[ INFO ] INFERENCE_NUM_THREADS: 4
[ INFO ] PERF_COUNT: NO
[ INFO ] INFERENCE_PRECISION_HINT: <Type: 'float32'>
[ INFO ] PERFORMANCE_HINT: THROUGHPUT
[ INFO ] EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ] PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ] ENABLE_CPU_PINNING: True
[ INFO ] SCHEDULING_CORE_TYPE: SchedulingCoreType.ANY_CORE
[ INFO ] ENABLE_HYPER_THREADING: True
[ INFO ] EXECUTION_DEVICES: ['CPU']
[ INFO ] CPU_DENORMALS_OPTIMIZATION: False
[ INFO ] CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE: 1.0
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'input_ids'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'attention_mask'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'token_type_ids'!. This input will be filled with random values!
[ INFO ] Fill input 'input_ids' with random values
[ INFO ] Fill input 'attention_mask' with random values
[ INFO ] Fill input 'token_type_ids' with random values
[Step 10/11] Measuring performance (Start inference asynchronously, 4 inference requests, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 30.14 ms
[Step 11/11] Dumping statistics report
[ INFO ] Execution Devices:['CPU']
[ INFO ] Count: 8852 iterations
[ INFO ] Duration: 60038.32 ms
[ INFO ] Latency:
[ INFO ] Median: 26.79 ms
[ INFO ] Average: 26.86 ms
[ INFO ] Min: 24.76 ms
[ INFO ] Max: 42.20 ms
[ INFO ] Throughput: 147.44 FPS
量子化された疎な推論パフォーマンスのベンチマーク¶
疎な重み解凍機能を有効にするには、以下のように実行時設定に追加できます。CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE
は 0.5 ~ 1.0 の値をとります。これは、レイヤーが有効になるレイヤーレベルの疎なしきい値です。
# Dump benchmarking config for dense inference
# "CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE" controls minimum sparsity rate for weights to consider
# for sparse optimization at the runtime.
with (quantized_sparse_dir / "perf_config_sparse.json").open("w") as outfile:
outfile.write(
"""
{
"CPU": {"NUM_STREAMS": 4, "INFERENCE_NUM_THREADS": 4, "CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE": 0.75}
}
"""
)
!benchmark_app -m $quantized_sparse_dir/openvino_model.xml -shape "input_ids[1,64],attention_mask[1,64],token_type_ids[1,64]" -load_config $quantized_sparse_dir/perf_config_sparse.json
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2023.3.0-13775-ceeafaf64f3-releases/2023/3
[ INFO ]
[ INFO ] Device info:
[ INFO ] CPU
[ INFO ] Build ................................. 2023.3.0-13775-ceeafaf64f3-releases/2023/3
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Performance hint was not explicitly specified in command line. Device(CPU) performance hint will be set to PerformanceMode.THROUGHPUT.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 71.12 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ] input_ids (node: input_ids) : i64 / [...] / [?,?]
[ INFO ] attention_mask (node: attention_mask) : i64 / [...] / [?,?]
[ INFO ] token_type_ids (node: token_type_ids) : i64 / [...] / [?,?]
[ INFO ] Model outputs:
[ INFO ] logits (node: logits) : f32 / [...] / [?,2]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 1
[ INFO ] Reshaping model: 'input_ids': [1,64], 'attention_mask': [1,64], 'token_type_ids': [1,64]
[ INFO ] Reshape model took 23.54 ms
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ] input_ids (node: input_ids) : i64 / [...] / [1,64]
[ INFO ] attention_mask (node: attention_mask) : i64 / [...] / [1,64]
[ INFO ] token_type_ids (node: token_type_ids) : i64 / [...] / [1,64]
[ INFO ] Model outputs:
[ INFO ] logits (node: logits) : f32 / [...] / [1,2]
[Step 7/11] Loading the model to the device
[ ERROR ] Exception from src/inference/src/core.cpp:99:
[ GENERAL_ERROR ] Exception from src/plugins/intel_cpu/src/config.cpp:158:
Wrong value for property key CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE. Expected only float numbers
Traceback (most recent call last):
File "/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/openvino/tools/benchmark/main.py", line 408, in main
compiled_model = benchmark.core.compile_model(model, benchmark.device, device_config)
File "/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/openvino/runtime/ie_api.py", line 547, in compile_model
super().compile_model(model, device_name, {} if config is None else config),
RuntimeError: Exception from src/inference/src/core.cpp:99:
[ GENERAL_ERROR ] Exception from src/plugins/intel_cpu/src/config.cpp:158:
Wrong value for property key CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE. Expected only float numbers