OpenVINO™ と第 4 世代インテル® Xeon® スケーラブル・プロセッサーによるスパース・トランスフォーマー・モデルの推論を高速化#

この Jupyter ノートブックはオンラインで起動でき、ブラウザーのウィンドウで対話型環境を開きます。ローカルにインストールすることもできます。次のオプションのいずれかを選択します:

Google ColabGitHub

このチュートリアルでは、第 4 世代インテル® Xeon® スケーラブル・プロセッサー上の OpenVINO を使用してスパース・トランスフォーマー・モデルのパフォーマンスを向上させる方法を説明します。

このチュートリアルでは、Optimum-Intel を使用して量子化、スパース化、および SST2 データセット用に調整された BERT ベースのモデルをダウンロードします。これは、モデルのスパース性を利用して効率を高めるランタイムオプションであるスパース・ウェイト・デコンプレッションを使用して実行することにより、第 4 世代インテル® Xeon® スケーラブル・プロセッサーでの推論パフォーマンスの利点を実証します。ノートブックは次の手順で構成されます:

  • 前提条件のインストール

  • OpenVINO と Hugging Face Optimum の統合を使用して、スパースパブリック BERT モデルをダウンロードして量子化します。

  • 疎密度 8 ビットと高密度 8 ビットの推論パフォーマンスを比較します。

目次:

必要条件#

%pip install -q "openvino>=2023.1.0" 
%pip install -q "git+https://github.com/huggingface/optimum-intel.git" "torch>=2.1" datasets onnx transformers>=4.33.0 --extra-index-url https://download.pytorch.org/whl/cpu
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

インポート#

import shutil 
from pathlib import Path 

from optimum.intel.openvino import OVModelForSequenceClassification 
from transformers import AutoTokenizer, pipeline 
from huggingface_hub import hf_hub_download
2024-07-13 03:25:40.698595: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on.You may see slightly different numerical results due to floating-point round-off errors from different computation orders.To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-07-13 03:25:40.733249: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.2024-07-13 03:25:41.315473: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-727/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. 
  torch.utils._pytree._register_pytree_node(

Hugging Face Optimum API を使用して、モデルをダウンロード、量子化、スパース化#

最初のステップは、OpenVINO IR に変換された量子化されたスパース・トランスフォーマーをダウンロードすることです。次に、ダウンロードしたモデルが動作するかどうか単純な検証として分類されます。モデルがどのように量子化およびスパース化されているかを確認するには、Hugging Face の OpenVINO/bert-base-uncased-sst2-int8-unstructured80 モデルカードを参照してください。

# 以下のモデルは、再現性のために OpenVINO と NNCF によって有効化された Optimum-Intel 1.7 を使用して量子化、スパース化されています。 
# https://huggingface.co/OpenVINO/bert-base-uncased-sst2-int8-unstructured80 を参照してください。 
model_id = "OpenVINO/bert-base-uncased-sst2-int8-unstructured80" 

# 次の 2 つの手順でモデルを設定し、HF キャッシュフォルダーにダウンロード 
ov_model = OVModelForSequenceClassification.from_pretrained(model_id) 
tokenizer = AutoTokenizer.from_pretrained(model_id) 

# モデルをスピンしてみます 
sentiment_classifier = pipeline("text-classification", model=ov_model, tokenizer=tokenizer) 

text = "He's a dreadful magician." 
outputs = sentiment_classifier(text) 

print(outputs)
Compiling the model to CPU ...
[{'label': 'negative', 'score': 0.9982142448425293}]

ベンチマークには、OpenVINO のベンチマーク・アプリケーションを使用し、IR を 1 つのフォルダーに配置します。

# フォルダーを作成 
quantized_sparse_dir = Path("bert_80pc_sparse_quantized_ir") 
quantized_sparse_dir.mkdir(parents=True, exist_ok=True) 

# キャッシュフォルダー内の指定されたファイル名へのパスを返します 
ov_ir_xml_path = hf_hub_download(repo_id=model_id, filename="openvino_model.xml") 
ov_ir_bin_path = hf_hub_download(repo_id=model_id, filename="openvino_model.bin") 

# IR をフォルダーにコピー 
shutil.copy(ov_ir_xml_path, quantized_sparse_dir) 
shutil.copy(ov_ir_bin_path, quantized_sparse_dir)
'bert_80pc_sparse_quantized_ir/openvino_model.bin'

量子化された密な推論パフォーマンスのベンチマーク#

4 つの CPU コアでの並列実行を使用して、クラウド・インフラストラクチャー内の小規模なインスタンスをシミュレートして、高密度推論のパフォーマンスをベンチマークします。シーケンス長はユースケースによって異なります。会話型 AI では 16 が一般的ですが、質問応答タスクでは 160 が一般的です。例として 64 に設定します。アプリケーションに基づいて調整することを推奨します。

# 高密度推論のベンチマーク設定をダンプ 
with (quantized_sparse_dir / "perf_config.json").open("w") as outfile: 
    outfile.write( 
        """ 
        { 
            "CPU": {"NUM_STREAMS": 4, "INFERENCE_NUM_THREADS": 4} 
        } 
        """ 
    )
!benchmark_app -m $quantized_sparse_dir/openvino_model.xml -shape 
"input_ids[1,64],attention_mask[1,64],token_type_ids[1,64]" -load_config $quantized_sparse_dir/perf_config.json
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:     - Avoid using tokenizers before the fork if possible 
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[Step 1/11] Parsing and validating input arguments 
[ INFO ] Parsing input parameters 
[Step 2/11] Loading OpenVINO Runtime 
[ INFO ] OpenVINO: 
[ INFO ] Build .................................2024.4.0-16028-fe423b97163 
[ INFO ] 
[ INFO ] Device info: 
[ INFO ] CPU 
[ INFO ] Build .................................2024.4.0-16028-fe423b97163 
[ INFO ] 
[ INFO ] 
[Step 3/11] Setting device configuration 
[ WARNING ] Performance hint was not explicitly specified in command line. Device(CPU) performance hint will be set to PerformanceMode.THROUGHPUT.
[Step 4/11] Reading model files 
[ INFO ] Loading model files 
[ INFO ] Read model took 62.05 ms 
[ INFO ] Original model I/O parameters: 
[ INFO ] Model inputs: 
[ INFO ]     images (node: images) : i64 / [...]/ [?,?]
[ INFO ]     images (node: images) : i64 / [...]/ [?,?]
[ INFO ]     images (node: images) : i64 / [...]/ [?,?]
[ INFO ] Model outputs: 
[ INFO ]     images (node: images) : f32 / [...]/ [?,2] 
[Step 5/11] Resizing model to match image sizes and given batch 
[ INFO ] Model batch size: 1 
[ INFO ] Reshaping model: 'input_ids': [1,64], 'attention_mask': [1,64], 'token_type_ids': [1,64] 
[ INFO ] Reshape model took 28.43 ms [Step 6/11] Configuring input of the model 
[ INFO ] Model inputs: 
[ INFO ]     images (node: images) : i64 / [...]/[1.64]
[ INFO ]     images (node: images) : i64 / [...]/[1.64]
[ INFO ]     images (node: images) : i64 / [...]/ [1,64] 
[ INFO ] Model outputs: 
[ INFO ]     images (node: images) : f32 / [...]/ [1,2] 
[Step 7/11] Loading the model to the device 
[ INFO ] Compile model took 1005.52 ms 
[Step 8/11] Querying optimal runtime parameters 
[ INFO ] Model: 
[ INFO ] NETWORK_NAME: torch_jit 
[ INFO ] OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4 
[ INFO ] NUM_STREAMS:4 
[ INFO ]     INFERENCE_NUM_THREADS: 4 
[ INFO ]     PERF_COUNT: NO 
[ INFO ]     SCHEDULING_CORE_TYPE: <Type: 'float32'> 
[ INFO ]     PERFORMANCE_HINT: THROUGHPUT 
[ INFO ]     EXECUTION_MODE_HINT: ExecutionMode. PERFORMANCE 
[ INFO ]     PERFORMANCE_HINT_NUM_REQUESTS: 0 
[ INFO ]     ENABLE_CPU_PINNING: True 
[ INFO ]     SCHEDULING_CORE_TYPE: SchedulingCoreType.ANY_CORE 
[ INFO ]     MODEL_DISTRIBUTION_POLICY: set() 
[ INFO ]     ENABLE_HYPER_THREADING: True 
[ INFO ]     EXECUTION_DEVICES: ['CPU'] 
[ INFO ]     CPU_DENORMALS_OPTIMIZATION: False 
[ INFO ]     LOG_LEVEL: Level. NO
[ INFO ]     CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE: 1.0 
[ INFO ]     DYNAMIC_QUANTIZATION_GROUP_SIZE: 32 
[ INFO ]     KV_CACHE_PRECISION: <Type: 'float16'> 
[ INFO ]     AFFINITY: Affinity. CORE 
[Step 9/11] Creating infer requests and preparing input tensors 
[ WARNING ] No input files were given for input 'input_ids'!. This input will be filled with random values! 
[ WARNING ] No input files were given for input 'attention_mask'!. This input will be filled with random values! 
[ WARNING ] No input files were given for input 'token_type_ids'!. This input will be filled with random values! 
[ INFO ] Fill input 'input_ids' with random values 
[ INFO ] Fill input 'attention_mask' with random values 
[ INFO ] Fill input 'token_type_ids' with random values 
[Step 10/11] Measuring performance (Start inference asynchronously, 4 inference requests, limits: 60000 ms duration) 
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 27.14 ms 
[Step 11/11] Dumping statistics report 
[ INFO ] Execution Devices:['CPU'] 
[ INFO ] Count: 9192 iterations 
[ INFO ] Duration: 60045.59 ms 
[ INFO ] Latency: 
[ INFO ]     Median: 25.82 ms 
[ INFO ]     Average: 25.87 ms 
[ INFO ]     Min: 24.44 ms 
[ INFO ]     Max: 40.26 ms 
[ INFO ] Throughput: 153.08 FPS

量子化された疎な推論パフォーマンスのベンチマーク#

疎な重み解凍機能を有効にするには、以下のように実行時設定に追加できます。CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE は 0.5 ~ 1.0 の値です。これは、レイヤーが有効になるレイヤーレベルの疎なしきい値です。

# 高密度推論のベンチマーク設定をダンプ 
# "CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE" は、実行時にスパース最適化を考慮する 
# 重みの最小スパース率を制御します。 
with (quantized_sparse_dir / "perf_config_sparse.json").open("w") as outfile: 
    outfile.write( 
        """ 
        { 
            "CPU": {"NUM_STREAMS": 4, "INFERENCE_NUM_THREADS": 4, "CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE": "0.75"} 
        } 
        """ 
    )
!benchmark_app -m $quantized_sparse_dir/openvino_model.xml -shape "input_ids[1,64],attention_mask[1,64],token_type_ids[1,64]" -load_config $quantized_sparse_dir/perf_config_sparse.json
huggingface/tokenizers: The current process just got forked, after parallelism has already been used.Disabling parallelism to avoid deadlocks...To disable this warning, you can either: - Avoid using tokenizers before the fork if possible 
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[Step 1/11] Parsing and validating input arguments 
[ INFO ] Parsing input parameters 
[Step 2/11] Loading OpenVINO Runtime 
[ INFO ] OpenVINO: 
[ INFO ] Build .................................2024.4.0-16028-fe423b97163 
[ INFO ] 
[ INFO ] Device info: 
[ INFO ] CPU 
[ INFO ] Build .................................2024.4.0-16028-fe423b97163 
[ INFO ] 
[ INFO ] 
[Step 3/11] Setting device configuration 
[ WARNING ] Performance hint was not explicitly specified in command line. Device(CPU) performance hint will be set to PerformanceMode.THROUGHPUT.
[Step 4/11] Reading model files 
[ INFO ] Loading model files 
[ INFO ] Read model took 89.36 ms 
[ INFO ] Original model I/O parameters: 
[ INFO ] Model inputs: 
[ INFO ]     images (node: images) : i64 / [...]/ [?,?]
[ INFO ]     images (node: images) : i64 / [...]/ [?,?]
[ INFO ]     images (node: images) : i64 / [...]/ [?,?]
[ INFO ] Model outputs:
[ INFO ]     images (node: images) : f32 / [...]/ [?,2] 
[Step 5/11] Resizing model to match image sizes and given batch 
[ INFO ] Model batch size: 1 
[ INFO ] Reshaping model: 'input_ids': [1,64], 'attention_mask': [1,64], 'token_type_ids': [1,64] 
[ INFO ] Reshape model took 28.62 ms [Step 6/11] Configuring input of the model 
[ INFO ] Model inputs: 
[ INFO ]     images (node: images) : i64 / [...]/[1.64]
[ INFO ]     images (node: images) : i64 / [...]/[1.64]
[ INFO ]     images (node: images) : i64 / [...]/ [1,64] 
[ INFO ] Model outputs: 
[ INFO ]     images (node: images) : f32 / [...]/ [1,2] 
[Step 7/11] Loading the model to the device 
[ INFO ] Compile model took 1091.53 ms 
[Step 8/11] Querying optimal runtime parameters 
[ INFO ] Model: 
[ INFO ] NETWORK_NAME: torch_jit 
[ INFO ] OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4 
[ INFO ] NUM_STREAMS: 4 
[ INFO ]     INFERENCE_NUM_THREADS: 4 
[ INFO ]     PERF_COUNT: NO 
[ INFO ]     SCHEDULING_CORE_TYPE: <Type: 'float32'> 
[ INFO ]     PERFORMANCE_HINT: THROUGHPUT 
[ INFO ]     EXECUTION_MODE_HINT: ExecutionMode. PERFORMANCE 
[ INFO ]     PERFORMANCE_HINT_NUM_REQUESTS: 0 
[ INFO ]     ENABLE_CPU_PINNING: True 
[ INFO ]     SCHEDULING_CORE_TYPE: SchedulingCoreType.ANY_CORE 
[ INFO ]     MODEL_DISTRIBUTION_POLICY: set() 
[ INFO ]     ENABLE_HYPER_THREADING: True 
[ INFO ]     EXECUTION_DEVICES: ['CPU'] 
[ INFO ]     CPU_DENORMALS_OPTIMIZATION: False 
[ INFO ]     LOG_LEVEL: Level. NO
[ INFO ]     CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE: 0.75 
[ INFO ]     DYNAMIC_QUANTIZATION_GROUP_SIZE: 32 
[ INFO ]     KV_CACHE_PRECISION: <Type: 'float16'> 
[ INFO ]     AFFINITY: Affinity. CORE 
[Step 9/11] Creating infer requests and preparing input tensors 
[ WARNING ] No input files were given for input 'input_ids'!. This input will be filled with random values! 
[ WARNING ] No input files were given for input 'attention_mask'!. This input will be filled with random values! 
[ WARNING ] No input files were given for input 'token_type_ids'!. This input will be filled with random values! 
[ INFO ] Fill input 'input_ids' with random values 
[ INFO ] Fill input 'attention_mask' with random values 
[ INFO ] Fill input 'token_type_ids' with random values 
[Step 10/11] Measuring performance (Start inference asynchronously, 4 inference requests, limits: 60000 ms duration) 
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 28.28 ms 
[Step 11/11] Dumping statistics report 
[ INFO ] Execution Devices:['CPU'] 
[ INFO ] Count: 9176 iterations 
[ INFO ] Duration: 60035.45 ms 
[ INFO ] Latency: 
[ INFO ]     Median: 25.86 ms 
[ INFO ]     Average: 25.90 ms 
[ INFO ]     Min: 23.07 ms 
[ INFO ]     Max: 41.68 ms 
[ INFO ] Throughput: 152.84 FPS

これが役立つ場合#

この機能により、複数の要求を非同期で並行して処理するようにモデルがデプロイされているシナリオで、疎な重みを持つモデルの推論パフォーマンスを向上させることができます。これは、シーケンスの長さが短い場合 (例えば、32 以下の場合) に特に有効です。

OpenVINO を使用した非同期推論の詳細については、次のドキュメントを参照してください: