NNCF のトレーニング後の量子化を使用して NLP モデルを量子化¶

この Jupyter ノートブックはオンラインで起動でき、ブラウザーのウィンドウで対話型環境を開きます。ローカルにインストールすることもできます。次のオプションのいずれかを選択します。

このチュートリアルでは、トレーニング後の量子化 API (NNCF ライブラリー) を使用して、BERT として知られる自然言語処理モデルに INT8 量子化を適用する方法を説明します。Microsoft Research Paraphrase Corpus (MRPC) でトレーニングされ、微調整された Hugging Face BERT PyTorch モデルが使用されます。チュートリアルは、カスタムモデルとデータセットに拡張できるように設計されています。これは次の手順で構成されます。

BERT モデルと MRPC データセットをダウンロードして準備します。
データの読み込みと精度検証の機能を定義します。
量子化用のモデルを準備します。
最適化パイプラインを実行します。
量子化モデルをロードしてテストします。
オリジナル、変換、量子化されたモデルのパフォーマンスを比較します。

                                    %pip install -q "nncf>=2.5.0"
%pip install -q transformers datasets evaluate --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "openvino>=2023.1.0"

                                

                                    Note: you may need to restart the kernel to use updated packages.

                                

                                    Note: you may need to restart the kernel to use updated packages.

                                

                                    Note: you may need to restart the kernel to use updated packages.

                                

インポート¶

                                        import os
import time
from pathlib import Path
from zipfile import ZipFile
from typing import Iterable
from typing import Any

import datasets
import evaluate
import numpy as np
import nncf
from nncf.parameters import ModelType
import openvino as ov
import torch
from transformers import BertForSequenceClassification, BertTokenizer

# Fetch `notebook_utils` module
import urllib.request
urllib.request.urlretrieve(
    url='https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/main/notebooks/utils/notebook_utils.py',
    filename='notebook_utils.py'
)
from notebook_utils import download_file

                                    

2024-02-09 22:38:10.763464: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-02-09 22:38:10.797722: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

                                        2024-02-09 22:38:11.441310: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

                                    

                                        INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino

                                    

設定¶

                                        # Set the data and model directories, source URL and the filename of the model.
DATA_DIR = "data"
MODEL_DIR = "model"
MODEL_LINK = "https://download.pytorch.org/tutorial/MRPC.zip"
FILE_NAME = MODEL_LINK.split("/")[-1]
PRETRAINED_MODEL_DIR = os.path.join(MODEL_DIR, "MRPC")

os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(MODEL_DIR, exist_ok=True)

                                    

モデルの準備¶

以下を実行します。

PyTorch による MRPC 用の事前トレーニング済み BERT モデルをダウンロードして解凍します。
モデルを OpenVINO 中間表現 (OpenVINO IR) に変換します。

                                        download_file(MODEL_LINK, directory=MODEL_DIR, show_progress=True)
with ZipFile(f"{MODEL_DIR}/{FILE_NAME}", "r") as zip_ref:
    zip_ref.extractall(MODEL_DIR)

                                    

model/MRPC.zip:   0%|          | 0.00/387M [00:00<?, ?B/s]

元の PyTorch モデルを OpenVINO 中間表現に変換します。

OpenVINO 2023.0 からは、モデル変換 API を使用してモデルを PyTorch 形式から OpenVINO IR 形式に直接変換できるようになりました。次の PyTorch モデル形式がサポートされています。

torch.nn.Module
torch.jit.ScriptModule
torch.jit.ScriptFunction

                                        MAX_SEQ_LENGTH = 128
input_shape = ov.PartialShape([1, -1])
ir_model_xml = Path(MODEL_DIR) / "bert_mrpc.xml"
core = ov.Core()

torch_model = BertForSequenceClassification.from_pretrained(PRETRAINED_MODEL_DIR)
torch_model.eval

input_info = [("input_ids", input_shape, np.int64),("attention_mask", input_shape, np.int64),("token_type_ids", input_shape, np.int64)]
default_input = torch.ones(1, MAX_SEQ_LENGTH, dtype=torch.int64)
inputs = {
    "input_ids": default_input,
    "attention_mask": default_input,
    "token_type_ids": default_input,
}

# Convert the PyTorch model to OpenVINO IR FP32.
if not ir_model_xml.exists():
    model = ov.convert_model(torch_model, example_input=inputs, input=input_info)
    ov.save_model(model, str(ir_model_xml))
else:
    model = core.read_model(ir_model_xml)

                                    

                                        /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()

                                        WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.base has been moved to tensorflow.python.trackable.base. The old module will be deleted in version 2.11.

                                    

                                        [ WARNING ]  Please fix your imports. Module %s has been moved to %s. The old module will be deleted in version %s.

                                    

                                        WARNING:nncf:NNCF provides best results with torch==2.1.2, while current torch version is 2.1.0+cpu. If you encounter issues, consider switching to torch==2.1.2

                                    

                                        No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'

                                    

データセットの準備¶

MRPC タスク用の一般言語理解評価 (GLUE) データセットを Hugging Face データセットからダウンロードします。次に、Hugging Face の事前トレーニング済み BERT トークナイザーを使用してデータをトークン化します。

                                        def create_data_source():
    raw_dataset = datasets.load_dataset('glue', 'mrpc', split='validation')
    tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_DIR)

    def _preprocess_fn(examples):
        texts = (examples['sentence1'], examples['sentence2'])
        result = tokenizer(*texts, padding='max_length', max_length=MAX_SEQ_LENGTH, truncation=True)
        result['labels'] = examples['label']
        return result
    processed_dataset = raw_dataset.map(_preprocess_fn, batched=True, batch_size=1)

    return processed_dataset

data_source = create_data_source()

                                    

NNCF トレーニング後の量子化 API を使用してモデルを最適化¶

NNCF は、精度の低下を最小限に抑えながら、OpenVINO でニューラル・ネットワーク推論を最適化する一連の高度なアルゴリズムを提供します。BERT を最適化するため、ポストトレーニング・モード (微調整パイプラインなし) で 8 ビット量子化を使用します。

最適化プロセスには次の手順が含まれます。

量子化用のデータセットを作成します。
nncf.quantize を実行して、最適化されたモデルを取得します。
openvino.save_model 関数を使用して OpenVINO IR モデルをシリアル化します。

                                        INPUT_NAMES = [key for key in inputs.keys()]

def transform_fn(data_item):
    """
    Extract the model's input from the data item.
    The data item here is the data item that is returned from the data source per iteration.
    This function should be passed when the data item cannot be used as model's input.
    """
    inputs = {
        name: np.asarray([data_item[name]], dtype=np.int64) for name in INPUT_NAMES
    }
    return inputs

calibration_dataset = nncf.Dataset(data_source, transform_fn)
# Quantize the model. By specifying model_type, we specify additional transformer patterns in the model.
quantized_model = nncf.quantize(model, calibration_dataset,
                                model_type=ModelType.TRANSFORMER)

                                    

Output()

Output()

                                        INFO:nncf:36 ignored nodes were found by name in the NNCFGraph

                                    

                                        INFO:nncf:50 ignored nodes were found by name in the NNCFGraph

                                    

Output()

Output()

                                        compressed_model_xml = Path(MODEL_DIR) / "quantized_bert_mrpc.xml"
ov.save_model(quantized_model, compressed_model_xml)

OpenVINO モデルのロードとテスト¶

変換されたモデルをロードしてテストするには、次の手順を実行します。

モデルをロードし、選択したデバイス用にコンパイルします。
入力を準備します。
推論を実行します。
モデルの出力から答えを取得します。

推論デバイスの選択¶

OpenVINO を使用して推論を実行するためにドロップダウン・リストからデバイスを選択します。

                                            import ipywidgets as widgets

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='AUTO',
    description='Device:',
    disabled=False,
)

device

                                        

                                            Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

                                        

                                            # Compile the model for a specific device.
compiled_quantized_model = core.compile_model(model=quantized_model, device_name=device.value)
output_layer = compiled_quantized_model.outputs[0]

                                        

データソースは 2 つの文 (sample_idx で示される) を返し、推論ではこれらの文を比較し、それらの意味が同じかどうかを出力します。他の文をテストするには、sample_idx を別の値 (0 ～ 407) に変更します。

                                            sample_idx = 5
sample = data_source[sample_idx]
inputs = {k: torch.unsqueeze(torch.tensor(sample[k]), 0) for k in ['input_ids', 'token_type_ids', 'attention_mask']}

result = compiled_quantized_model(inputs)[output_layer]
result = np.argmax(result)

print(f"Text 1: {sample['sentence1']}")
print(f"Text 2: {sample['sentence2']}")
print(f"The same meaning: {'yes' if result == 1 else 'no'}")

                                        

                                            Text 1: Wal-Mart said it would check all of its million-plus domestic workers to ensure they were legally employed .
Text 2: It has also said it would review all of its domestic employees more than 1 million to ensure they have legal status .
The same meaning: yes

                                        

FP32 モデルと INT8 モデルの F1 スコアを比較¶

                                        def validate(model: ov.Model, dataset: Iterable[Any]) -> float:
    """
    Evaluate the model on GLUE dataset.
    Returns F1 score metric.
    """
    compiled_model = core.compile_model(model, device_name=device.value)
    output_layer = compiled_model.output(0)

    metric = evaluate.load('glue', 'mrpc')
    for batch in dataset:
        inputs = [
            np.expand_dims(np.asarray(batch[key], dtype=np.int64), 0) for key in INPUT_NAMES
        ]
        outputs = compiled_model(inputs)[output_layer]
        predictions = outputs[0].argmax(axis=-1)
        metric.add_batch(predictions=[predictions], references=[batch['labels']])
    metrics = metric.compute()
    f1_score = metrics['f1']

    return f1_score


print('Checking the accuracy of the original model:')
metric = validate(model, data_source)
print(f'F1 score: {metric:.4f}')

print('Checking the accuracy of the quantized model:')
metric = validate(quantized_model, data_source)
print(f'F1 score: {metric:.4f}')

                                    

                                        Checking the accuracy of the original model:

                                    

                                        F1 score: 0.9019
Checking the accuracy of the quantized model:

                                        F1 score: 0.8969

                                    

オリジナルと変換および量子化されたモデルのパフォーマンスの比較¶

オリジナルの PyTorch モデルと OpenVINO に変換され量子化されたモデル (FP32、INT8) を比較して、パフォーマンスの違いを確認します。これは、画像の 1 秒あたりのフレーム数 (FPS) と同様に、1 秒あたりの文数 (SPS) の単位で表されます。

                                        # Compile the model for a specific device.
compiled_model = core.compile_model(model=model, device_name=device.value)

                                        num_samples = 50
sample = data_source[0]
inputs = {k: torch.unsqueeze(torch.tensor(sample[k]), 0) for k in ['input_ids', 'token_type_ids', 'attention_mask']}

with torch.no_grad():
    start = time.perf_counter()
    for _ in range(num_samples):
        torch_model(torch.vstack(list(inputs.values())))
    end = time.perf_counter()
    time_torch = end - start
print(
    f"PyTorch model on CPU: {time_torch / num_samples:.3f} seconds per sentence, "
    f"SPS: {num_samples / time_torch:.2f}"
)

start = time.perf_counter()
for _ in range(num_samples):
    compiled_model(inputs)
end = time.perf_counter()
time_ir = end - start
print(
    f"IR FP32 model in OpenVINO Runtime/{device.value}: {time_ir / num_samples:.3f} "
    f"seconds per sentence, SPS: {num_samples / time_ir:.2f}"
)

start = time.perf_counter()
for _ in range(num_samples):
    compiled_quantized_model(inputs)
end = time.perf_counter()
time_ir = end - start
print(
    f"OpenVINO IR INT8 model in OpenVINO Runtime/{device.value}: {time_ir / num_samples:.3f} "
    f"seconds per sentence, SPS: {num_samples / time_ir:.2f}"
)

                                    

We strongly recommend passing in an attention_mask since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.

                                        PyTorch model on CPU: 0.073 seconds per sentence, SPS: 13.77

                                    

                                        IR FP32 model in OpenVINO Runtime/AUTO: 0.021 seconds per sentence, SPS: 47.89

                                    

                                        OpenVINO IR INT8 model in OpenVINO Runtime/AUTO: 0.009 seconds per sentence, SPS: 109.72

                                    

最後に、OpenVINO FP32 モデルと INT8 モデルの推論パフォーマンスを測定します。これには、OpenVINO のベンチマーク・ツールを使用します。

注: benchmark_app ツールは、OpenVINO 中間表現 (OpenVINO IR) モデルのパフォーマンスのみを測定できます。より正確なパフォーマンスを得るには、他のアプリケーションを閉じて、ターミナル/コマンドプロンプトで benchmark_app を実行します。benchmark_app -m model.xml -d CPU を実行して、CPU で非同期推論のベンチマークを 1 分間実行します。GPU でベンチマークを行うには、CPU を GPU に変更します。benchmark_app --help を実行すると、すべてのコマンドライン・オプションの概要が表示されます。

                                        # Inference FP32 model (OpenVINO IR)
!benchmark_app -m $ir_model_xml -shape [1,128],[1,128],[1,128] -d device.value -api sync

                                        [Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ WARNING ] Default duration 120 seconds is used for unknown device device.value
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2023.3.0-13775-ceeafaf64f3-releases/2023/3
[ INFO ]
[ INFO ] Device info:
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ ERROR ] Exception from src/inference/src/core.cpp:228:
Exception from src/inference/src/dev/core_impl.cpp:560:
Device with "device" name is not registered in the OpenVINO Runtime

Traceback (most recent call last):
  File "/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/openvino/tools/benchmark/main.py", line 166, in main
    supported_properties = benchmark.core.get_property(device, properties.supported_properties())
RuntimeError: Exception from src/inference/src/core.cpp:228:
Exception from src/inference/src/dev/core_impl.cpp:560:
Device with "device" name is not registered in the OpenVINO Runtime

                                    

                                        # Inference INT8 model (OpenVINO IR)
! benchmark_app -m $compressed_model_xml -shape [1,128],[1,128],[1,128] -d device.value -api sync

                                        [Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ WARNING ] Default duration 120 seconds is used for unknown device device.value
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2023.3.0-13775-ceeafaf64f3-releases/2023/3
[ INFO ]
[ INFO ] Device info:
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ ERROR ] Exception from src/inference/src/core.cpp:228:
Exception from src/inference/src/dev/core_impl.cpp:560:
Device with "device" name is not registered in the OpenVINO Runtime

Traceback (most recent call last):
  File "/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/openvino/tools/benchmark/main.py", line 166, in main
    supported_properties = benchmark.core.get_property(device, properties.supported_properties())
RuntimeError: Exception from src/inference/src/core.cpp:228:
Exception from src/inference/src/dev/core_impl.cpp:560:
Device with "device" name is not registered in the OpenVINO Runtime