NNCF PTQ API を使用した音声認識モデルの量子化

この Jupyter ノートブックは、ローカルへのインストール後にのみ起動できます。

GitHub

このチュートリアルでは、トレーニング後のモード (微調整パイプラインを使用しない) で NNCF (ニューラル・ネットワーク圧縮フレームワーク) 8 ビット量子化を使用し、Wav2Vec2 として知られる音声認識モデルに INT8 量子化を適用する方法を示します。このノートブックは、LibriSpeech ASR コーパスでトレーニングされ、微調整された Wav2Vec2-Base-960h PyTorch モデルを使用します。チュートリアルは、カスタムモデルとデータセットに拡張できるように設計されています。これは次の手順で構成されます。

  • Wav2Vec2 モデルと LibriSpeech データセットをダウンロードして準備します。

  • データの読み込みと精度検証の機能を定義します。

  • モデルの量子化。

  • 元の PyTorch モデル、OpenVINO FP16 および INT8 モデルの精度を比較します。

  • 元のモデルと量子化されたモデルのパフォーマンスを比較します。

目次

%pip install -q "openvino>=2023.3.0" "nncf>=2.7"
%pip install datasets "torchmetrics>=0.11.0" "torch>=2.1.0" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q soundfile librosa "transformers>=4.36.2" --extra-index-url https://download.pytorch.org/whl/cpu
Note: you may need to restart the kernel to use updated packages.
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu
Requirement already satisfied: datasets in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (2.17.0)
Collecting torchmetrics>=0.11.0
                                    Using cached torchmetrics-1.3.0.post0-py3-none-any.whl.metadata (20 kB)
Requirement already satisfied: torch>=2.1.0 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (2.1.0+cpu)
Requirement already satisfied: filelock in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (3.13.1)
Requirement already satisfied: numpy>=1.17 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (1.23.5)
Requirement already satisfied: pyarrow>=12.0.0 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (15.0.0)
Requirement already satisfied: pyarrow-hotfix in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (0.6)
Requirement already satisfied: dill<0.3.9,>=0.3.0 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (0.3.8)
Requirement already satisfied: pandas in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (2.0.3)
Requirement already satisfied: requests>=2.19.0 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (2.31.0)
Requirement already satisfied: tqdm>=4.62.1 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (4.66.1)
Requirement already satisfied: xxhash in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (3.4.1)
Requirement already satisfied: multiprocess in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (0.70.16)
Requirement already satisfied: fsspec<=2023.10.0,>=2023.1.0 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from fsspec[http]<=2023.10.0,>=2023.1.0->datasets) (2023.10.0)
Requirement already satisfied: aiohttp in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (3.9.3)
Requirement already satisfied: huggingface-hub>=0.19.4 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (0.20.3)
Requirement already satisfied: packaging in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (23.2)
Requirement already satisfied: pyyaml>=5.1 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (6.0.1)
Collecting lightning-utilities>=0.8.0 (from torchmetrics>=0.11.0)
                                    Using cached lightning_utilities-0.10.1-py3-none-any.whl.metadata (4.8 kB)
Requirement already satisfied: typing-extensions in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from torchmetrics>=0.11.0) (4.9.0)
Requirement already satisfied: sympy in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from torch>=2.1.0) (1.12)
Requirement already satisfied: networkx in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from torch>=2.1.0) (3.1)
Requirement already satisfied: jinja2 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from torch>=2.1.0) (3.1.3)
Requirement already satisfied: aiosignal>=1.1.2 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: attrs>=17.3.0 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from aiohttp->datasets) (23.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from aiohttp->datasets) (1.4.1)
Requirement already satisfied: multidict<7.0,>=4.5 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from aiohttp->datasets) (6.0.5)
Requirement already satisfied: yarl<2.0,>=1.0 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from aiohttp->datasets) (1.9.4)
Requirement already satisfied: async-timeout<5.0,>=4.0 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from aiohttp->datasets) (4.0.3)
Requirement already satisfied: setuptools in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from lightning-utilities>=0.8.0->torchmetrics>=0.11.0) (69.0.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (2.2.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (2024.2.2)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from jinja2->torch>=2.1.0) (2.1.5)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from pandas->datasets) (2024.1)
Requirement already satisfied: tzdata>=2022.1 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from pandas->datasets) (2023.4)
Requirement already satisfied: mpmath>=0.19 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from sympy->torch>=2.1.0) (1.3.0)
Requirement already satisfied: six>=1.5 in /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)
Using cached torchmetrics-1.3.0.post0-py3-none-any.whl (840 kB)
Using cached lightning_utilities-0.10.1-py3-none-any.whl (24 kB)
Installing collected packages: lightning-utilities, torchmetrics
Successfully installed lightning-utilities-0.10.1 torchmetrics-1.3.0.post0
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

インポート

import numpy as np
import openvino as ov
import torch
import IPython.display as ipd

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

設定

from pathlib import Path

# Set the data and model directories, model source URL and model filename.
MODEL_DIR = Path("model")
MODEL_DIR.mkdir(exist_ok=True)

モデルの準備

次の手順を実行します:
- 事前トレーニングされた Wav2Vec2 モデルをダウンロードして解凍します。- モデル変換 API を実行して、モデルを PyTorch 表現から OpenVINO 中間表現 (OpenVINO IR) に変換します。

torch_model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h", ctc_loss_reduction="mean")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
BATCH_SIZE = 1
MAX_SEQ_LENGTH = 30480
ov_model = ov.convert_model(torch_model, example_input=torch.zeros([1, MAX_SEQ_LENGTH], dtype=torch.float))

ir_model_path = MODEL_DIR / "wav2vec2_base.xml"
ov.save_model(ov_model, ir_model_path)
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py:593: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py:632: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):

LibriSpeech データセットの準備

デモでモデルの評価をスピードアップするため、LibriSpeech データセットの短いダミーバージョン (patrickvonplaten/librispeech_asr_dummy) を使用します。モデルの精度は論文の報告と異なる場合があります。元の精度を再現するには、librispeech_asr データセットを使用します。

from datasets import load_dataset


dataset = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
test_sample = dataset[0]["audio"]


# define preprocessing function for converting audio to input values for model
def map_to_input(batch):
    preprocessed_signal = processor(batch["audio"]["array"], return_tensors="pt", padding="longest", sampling_rate=batch['audio']['sampling_rate'])
    input_values = preprocessed_signal.input_values
    batch['input_values'] = input_values
    return batch


# apply preprocessing function to dataset and remove audio column, to save memory as we do not need it anymore
dataset = dataset.map(map_to_input, batched=False, remove_columns=["audio"])
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/datasets/load.py:1454: FutureWarning: The repository for patrickvonplaten/librispeech_asr_dummy contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/patrickvonplaten/librispeech_asr_dummy
You can avoid this message in future by passing the argument trust_remote_code=True.
Passing trust_remote_code=True will be mandatory to load this dataset from the next major release of datasets.
  warnings.warn(

量子化の実行

NNCF は、精度の低下を最小限に抑えながら、OpenVINO でニューラル・ネットワーク推論を最適化する一連の高度なアルゴリズムを提供します。

事前トレーニングされた FP16 モデルとキャリブレーション・データセットから量子化モデルを作成します。最適化プロセスには次の手順が含まれます。

  1. 量子化用のデータセットを作成します。

  2. nncf.quantize を実行して、最適化されたモデルを取得します。nncf.quantize 関数は、モデル量子化のインターフェイスを提供します。OpenVINO モデルのインスタンスと量子化データセットが必要です。オプションで、量子化プロセスの追加パラメーター (量子化のサンプル数、プリセット、無視される範囲など) を提供できます。より正確な結果を得るには、ignored_scope パラメーターを使用して、後処理サブグラフの操作を浮動小数点精度に保つ必要があります。詳細については、量子化パラメーターの調整を参照してください。このモデルでは、精度制御による量子化の結果に基づいて、無視されるスコープが実験的に選択されました。仕組みを理解するには、次のノートブックを確認してください。

  3. ov.save_model 関数を使用して OpenVINO IR モデルをシリアル化します。

import nncf
from nncf.parameters import ModelType

def transform_fn(data_item):
    """
    Extract the model's input from the data item.
    The data item here is the data item that is returned from the data source per iteration.
    This function should be passed when the data item cannot be used as model's input.
    """
    return np.array(data_item["input_values"])


calibration_dataset = nncf.Dataset(dataset, transform_fn)

quantized_model = nncf.quantize(
    ov_model,
    calibration_dataset,
    model_type=ModelType.TRANSFORMER,  # specify additional transformer patterns in the model
    ignored_scope=nncf.IgnoredScope(
        names=[
            "__module.wav2vec2.feature_extractor.conv_layers.1.conv/aten::_convolution/Convolution",
            "__module.wav2vec2.feature_extractor.conv_layers.2.conv/aten::_convolution/Convolution",
            "__module.wav2vec2.feature_extractor.conv_layers.3.conv/aten::_convolution/Convolution",
            "__module.wav2vec2.feature_extractor.conv_layers.0.conv/aten::_convolution/Convolution",
        ],
    ),
)
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino
2024-02-09 22:44:10.617295: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-02-09 22:44:10.648558: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-09 22:44:11.243901: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Output()
Output()
INFO:nncf:4 ignored nodes were found by name in the NNCFGraph
INFO:nncf:36 ignored nodes were found by name in the NNCFGraph
INFO:nncf:50 ignored nodes were found by name in the NNCFGraph
INFO:nncf:Not adding activation input quantizer for operation: 3 __module.wav2vec2.feature_extractor.conv_layers.0.conv/aten::_convolution/Convolution
INFO:nncf:Not adding activation input quantizer for operation: 11 __module.wav2vec2.feature_extractor.conv_layers.1.conv/aten::_convolution/Convolution
INFO:nncf:Not adding activation input quantizer for operation: 13 __module.wav2vec2.feature_extractor.conv_layers.2.conv/aten::_convolution/Convolution
INFO:nncf:Not adding activation input quantizer for operation: 15 __module.wav2vec2.feature_extractor.conv_layers.3.conv/aten::_convolution/Convolution
Output()
Output()
MODEL_NAME = 'quantized_wav2vec2_base'
quantized_model_path = Path(f"{MODEL_NAME}_openvino_model/{MODEL_NAME}_quantized.xml")
ov.save_model(quantized_model, quantized_model_path)

推論パイプラインを使用したモデルの使用例

初期 (FP16) モデルと量子化 (INT8) モデルはどちらも使い方はまったく同じです。

最初にデータセットから 1 つの例を取り出して、その推論手順を示します。

次に、量子化モデルを推論パイプラインにロードします。

ipd.Audio(test_sample["array"], rate=16000)
core = ov.Core()

compiled_model = core.compile_model(model=quantized_model, device_name='CPU')

input_data = np.expand_dims(test_sample["array"], axis=0)

次に、推測を行います。

predictions = compiled_model(input_data)[0]
predicted_ids = np.argmax(predictions, axis=-1)
transcription = processor.batch_decode(torch.from_numpy(predicted_ids))
print(transcription)
['A MAN SAID TO THE UNIVERSE SIR I EXIST']

データセット上のモデルの精度を検証

モデルの精度評価には、Word Error Rate メトリックを使用できます。Word Error Rate (WER) は、発話された総単語数に対するトランスクリプト内のエラーの割合です。音声テキスト変換において WER が低いことは、音声認識の精度が高いことを意味します。

WER の計算には torchmetrics ライブラリーを使用します。

from torchmetrics import WordErrorRate
from tqdm.notebook import tqdm

# inference function for pytorch
def torch_infer(model, sample):
    logits = model(torch.Tensor(sample['input_values'])).logits
    # take argmax and decode
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    return transcription


# inference function for openvino
def ov_infer(model, sample):
    logits = model(np.array(sample['input_values']))[0]
    predicted_ids = np.argmax(logits, axis=-1)
    transcription = processor.batch_decode(torch.from_numpy(predicted_ids))
    return transcription


def compute_wer(dataset, model, infer_fn):
    wer = WordErrorRate()
    for sample in tqdm(dataset):
        # run infer function on sample
        transcription = infer_fn(model, sample)
        # update metric on sample result
        wer.update(transcription, [sample['text']])
    # finalize metric calculation
    result = wer.compute()
    return result

ここでは、トークナイザー decode_logits を使用して、予測確率をテキストにデコードするだけです。

あるいは、transformers パッケージのビルトイン Wav2Vec2Processor トークナイザーを使用します。

ここで、元の PyTorch モデル、OpenVINO IR モデル、および量子化モデルの WER を計算します。

compiled_fp32_ov_model = core.compile_model(ov_model)

pt_result = compute_wer(dataset, torch_model, torch_infer)
ov_result = compute_wer(dataset, compiled_fp32_ov_model, ov_infer)
int8_ov_result = compute_wer(dataset, compiled_model, ov_infer)
print(f'[PyTorch]       Word Error Rate: {pt_result:.4f}')
print(f'[OpenVINO FP16] Word Error Rate: {ov_result:.4}')
print(f'[OpenVINO INT8] Word Error Rate: {int8_ov_result:.4f}')
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:62: FutureWarning: Importing WordErrorRate from torchmetrics was deprecated and will be removed in 2.0. Import WordErrorRate from torchmetrics.text instead.
  _future_warning(
0%|          | 0/73 [00:00<?, ?it/s]
0%|          | 0/73 [00:00<?, ?it/s]
0%|          | 0/73 [00:00<?, ?it/s]
[PyTorch]       Word Error Rate: 0.0530
[OpenVINO FP16] Word Error Rate: 0.05304
[OpenVINO INT8] Word Error Rate: 0.0539

元のモデルと量子化モデルのパフォーマンスを比較

最後に、Benchmark ツールを使用して、FP16INT8 モデルの推論パフォーマンスを測定します。

注: より正確なパフォーマンスを得るには、他のアプリケーションを閉じて、ターミナル/コマンドプロンプトで benchmark_app を実行することを推奨します。benchmark_app -m model.xml -d CPU を実行して、CPU で非同期推論のベンチマークを 1 分間実行します。GPU でベンチマークを行うには、CPUGPU に変更します。benchmark_app --help を実行すると、すべてのコマンドライン・オプションが表示されます。

# Inference FP16 model (OpenVINO IR)
! benchmark_app -m $ir_model_path -shape [1,30480] -d CPU -api async
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2023.3.0-13775-ceeafaf64f3-releases/2023/3
[ INFO ]
[ INFO ] Device info:
[ INFO ] CPU
[ INFO ] Build ................................. 2023.3.0-13775-ceeafaf64f3-releases/2023/3
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Performance hint was not explicitly specified in command line. Device(CPU) performance hint will be set to PerformanceMode.THROUGHPUT.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 56.59 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     input_values (node: input_values) : f32 / [...] / [?,?]
[ INFO ] Model outputs:
[ INFO ]     1170 , logits (node: __module.lm_head/aten::linear/Add) : f32 / [...] / [?,?,32]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 1
[ INFO ] Reshaping model: 'input_values': [1,30480]
[ INFO ] Reshape model took 30.54 ms
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ]     input_values (node: input_values) : f32 / [...] / [1,30480]
[ INFO ] Model outputs:
[ INFO ]     1170 , logits (node: __module.lm_head/aten::linear/Add) : f32 / [...] / [1,95,32]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 559.12 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ]   NETWORK_NAME: Model0
[ INFO ]   OPTIMAL_NUMBER_OF_INFER_REQUESTS: 6
[ INFO ]   NUM_STREAMS: 6
[ INFO ]   AFFINITY: Affinity.CORE
[ INFO ]   INFERENCE_NUM_THREADS: 24
[ INFO ]   PERF_COUNT: NO
[ INFO ]   INFERENCE_PRECISION_HINT: <Type: 'float32'>
[ INFO ]   PERFORMANCE_HINT: THROUGHPUT
[ INFO ]   EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]   ENABLE_CPU_PINNING: True
[ INFO ]   SCHEDULING_CORE_TYPE: SchedulingCoreType.ANY_CORE
[ INFO ]   ENABLE_HYPER_THREADING: True
[ INFO ]   EXECUTION_DEVICES: ['CPU']
[ INFO ]   CPU_DENORMALS_OPTIMIZATION: False
[ INFO ]   CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE: 1.0
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'input_values'!. This input will be filled with random values!
[ INFO ] Fill input 'input_values' with random values
[Step 10/11] Measuring performance (Start inference asynchronously, 6 inference requests, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 56.43 ms
[Step 11/11] Dumping statistics report
[ INFO ] Execution Devices:['CPU']
[ INFO ] Count:            2778 iterations
[ INFO ] Duration:         60216.97 ms
[ INFO ] Latency:
[ INFO ]    Median:        129.97 ms
[ INFO ]    Average:       129.78 ms
[ INFO ]    Min:           110.04 ms
[ INFO ]    Max:           151.03 ms
[ INFO ] Throughput:   46.13 FPS
# Inference INT8 model (OpenVINO IR)
! benchmark_app -m $quantized_model_path -shape [1,30480] -d CPU -api async
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2023.3.0-13775-ceeafaf64f3-releases/2023/3
[ INFO ]
[ INFO ] Device info:
[ INFO ] CPU
[ INFO ] Build ................................. 2023.3.0-13775-ceeafaf64f3-releases/2023/3
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Performance hint was not explicitly specified in command line. Device(CPU) performance hint will be set to PerformanceMode.THROUGHPUT.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 67.98 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     input_values (node: input_values) : f32 / [...] / [?,?]
[ INFO ] Model outputs:
[ INFO ]     1170 , logits (node: __module.lm_head/aten::linear/Add) : f32 / [...] / [?,?,32]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 1
[ INFO ] Reshaping model: 'input_values': [1,30480]
[ INFO ] Reshape model took 37.53 ms
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ]     input_values (node: input_values) : f32 / [...] / [1,30480]
[ INFO ] Model outputs:
[ INFO ]     1170 , logits (node: __module.lm_head/aten::linear/Add) : f32 / [...] / [1,95,32]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 1060.01 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ]   NETWORK_NAME: Model0
[ INFO ]   OPTIMAL_NUMBER_OF_INFER_REQUESTS: 6
[ INFO ]   NUM_STREAMS: 6
[ INFO ]   AFFINITY: Affinity.CORE
[ INFO ]   INFERENCE_NUM_THREADS: 24
[ INFO ]   PERF_COUNT: NO
[ INFO ]   INFERENCE_PRECISION_HINT: <Type: 'float32'>
[ INFO ]   PERFORMANCE_HINT: THROUGHPUT
[ INFO ]   EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]   ENABLE_CPU_PINNING: True
[ INFO ]   SCHEDULING_CORE_TYPE: SchedulingCoreType.ANY_CORE
[ INFO ]   ENABLE_HYPER_THREADING: True
[ INFO ]   EXECUTION_DEVICES: ['CPU']
[ INFO ]   CPU_DENORMALS_OPTIMIZATION: False
[ INFO ]   CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE: 1.0
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'input_values'!. This input will be filled with random values!
[ INFO ] Fill input 'input_values' with random values
[Step 10/11] Measuring performance (Start inference asynchronously, 6 inference requests, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 62.07 ms
[Step 11/11] Dumping statistics report
[ INFO ] Execution Devices:['CPU']
[ INFO ] Count:            4296 iterations
[ INFO ] Duration:         60070.01 ms
[ INFO ] Latency:
[ INFO ]    Median:        83.98 ms
[ INFO ]    Average:       83.75 ms
[ INFO ]    Min:           45.06 ms
[ INFO ]    Max:           106.14 ms
[ INFO ] Throughput:   71.52 FPS