NNCF を使用した OpenAI Whisper モデルのトレーニング後の量子化¶

この Jupyter ノートブックは、ローカルへのインストール後にのみ起動できます。

このチュートリアルは、NNCF (Neural Network Compression Framework) から 8 ビットのトレーニング後の量子化を適用してモデルを高速化し、OpenVINO™ ツールキットを介して量子化されたモデルを推論する方法を示します。最適化プロセスには次の手順が含まれます。

NNCF を使用して、227-whisper-convert notebook から変換された OpenVINO モデルを量子化します。
デモのビデオでモデルの結果を確認してください。
FP32 モデルと量子化された INT8 モデルのモデルサイズ、パフォーマンス、精度を比較します。

注: 最初に 227-whisper-convert ノートブックを実行して、量子化に使用される OpenVINO IR モデルを生成する必要があります。

目次¶

必要条件
量子化の作成と初期化
- キャリブレーション・データセットの準備
- 量子化ウィスパー・エンコーダーおよびデコーダーモデル
量子化された OpenVINO モデルによりビデオを転写
FP32 と INT8 IR のパフォーマンスと精度を比較

必要条件¶

依存関係をインストールします。

                                        %pip install -q "openvino>=2023.1.0"
%pip install -q "nncf>=2.6.0"
%pip install -q datasets librosa soundfile
%pip install -q evaluate jiwer

                                    

量子化するモデルを選択します。

                                        from pathlib import Path
import ipywidgets as widgets

def get_model_id(model_path):
    return model_path.name.replace("whisper_", "").replace("encoder.xml", "").replace("_", "")

model_list = [get_model_id(model_path) for model_path in Path('.').glob("whisper_*encoder.xml")]
model_list = [model_name for model_name in model_list if model_name]

if not model_list:
    raise RuntimeError("Please run conversion notebook first")

model_id = widgets.Dropdown(
    options=model_list,
    value=model_list[0],
    description='Model:',
    disabled=False,
)

model_id

                                    

                                        Dropdown(description='Model:', options=('large-v2', 'large-v3'), value='large-v2')

                                    

OpenVINO を使用して推論を実行するデバイスをドロップダウン・リストから選択します。

                                        import ipywidgets as widgets

from openvino import Core
core = Core()

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='AUTO',
    description='Device:',
    disabled=False,
)

device

                                    

                                        Dropdown(description='Device:', index=2, options=('CPU', 'GPU', 'AUTO'), value='AUTO')

                                    

モデルのタスクを選択します。

転写 - ソース言語で音声転写を生成します (自動的に検出されます)。
翻訳 - 英語へ翻訳付きの音声文字起こしを生成します。

                                        task = widgets.Select(
    options=["transcribe", "translate"],
    value="translate",
    description="Select task:",
    disabled=False
)
task

                                    

                                        Select(description='Select task:', index=1, options=('transcribe', 'translate'), value='translate')

                                    

量子化の作成と初期化¶

NNCF は、モデルグラフに量子化レイヤーを追加し、トレーニング・データセットのサブセットを使用してこれらの追加の量子化レイヤーのパラメーターを初期化することで、トレーニング後の量子化を可能にします。このフレームワークは、元のトレーニング・コードへの変更が最小限になるように設計されています。量子化は最も単純なシナリオであり、いくつかの変更が必要です。

最適化プロセスには次の手順が含まれます。

量子化用のキャリブレーション・データセットを作成します。
nncf.quantize を実行して、量子化されたモデルを取得します。
openvino.runtime.serialize 関数を使用して INT8 モデルをシリアル化します。

227-whisper-convert ノートブックで変換されたモデルへのパスと、量子化されたモデルが保存されるパスを設定します。

                                        from pathlib import Path

WHISPER_ENCODER_OV = Path(f"whisper_{model_id.value}_encoder.xml")
WHISPER_DECODER_OV = Path(f"whisper_{model_id.value}_decoder.xml")

WHISPER_ENCODER_OV_INT8 = Path(f"whisper_{model_id.value}_encoder_int8.xml")
WHISPER_DECODER_OV_INT8 = Path(f"whisper_{model_id.value}_decoder_int8.xml")

FP32 モデル IR をロードします。

                                        import whisper
from utils import patch_whisper_for_ov_inference, OpenVINOAudioEncoder, OpenVINOTextDecoder

model_fp32 = whisper.load_model(model_id.value, "cpu").eval()
patch_whisper_for_ov_inference(model_fp32)

model_fp32.encoder = OpenVINOAudioEncoder(core, WHISPER_ENCODER_OV, device=device.value)
model_fp32.decoder = OpenVINOTextDecoder(core, WHISPER_DECODER_OV, device=device.value)

                                    

キャリブレーション・データセットの準備¶

Whisper はエンコーダー・モデルとデコーダーモデルで構成されます。両方のキャリブレーション・データを収集する必要があります。

以下では、キャリブレーションのサンプルを収集するためエンコーダー/デコーダーの転送メソッドを上書きします。

                                            from contextlib import contextmanager
from functools import partial
import openvino as ov
from typing import Optional
import torch

COLLECT_CALIBRATION_DATA = False
encoder_calibration_data = []
decoder_calibration_data = []

@contextmanager
def calibration_data_collection():
    global COLLECT_CALIBRATION_DATA
    try:
        COLLECT_CALIBRATION_DATA = True
        yield
    finally:
        COLLECT_CALIBRATION_DATA = False


def encoder_forward(self, mel: torch.Tensor):
    if COLLECT_CALIBRATION_DATA:
        encoder_calibration_data.append(mel)
    return torch.from_numpy(self.compiled_model(mel)[self.output_blob])

def decoder_forward(self, x: torch.Tensor, xa: torch.Tensor, kv_cache: Optional[dict] = None):
    feed_dict = {'x': ov.Tensor(x.numpy()), 'xa': ov.Tensor(xa.numpy())}
    feed_dict = (self.preprocess_kv_cache_inputs(feed_dict, kv_cache))
    if COLLECT_CALIBRATION_DATA:
        decoder_calibration_data.append(feed_dict)
    res = self.compiled_model(feed_dict)
    return self.postprocess_outputs(res)

model_fp32.encoder.forward = partial(encoder_forward, model_fp32.encoder)
model_fp32.decoder.forward = partial(decoder_forward, model_fp32.decoder)

                                        

Hugging Face の検証 librispeech_asr データセットの一部をキャリブレーション・データとして使用します。

                                            from datasets import load_dataset
from tqdm.notebook import tqdm

CALIBRATION_DATASET_SIZE = 30

calibration_dataset = load_dataset("librispeech_asr", "clean", split="validation", streaming=True).take(CALIBRATION_DATASET_SIZE)

with calibration_data_collection():
    for data_item in tqdm(calibration_dataset, desc="Collecting calibration data", total=CALIBRATION_DATASET_SIZE):
        model_fp32.transcribe(data_item["audio"]["array"].astype("float32"), task=task.value)

Collecting calibration data:   0%|          | 0/30 [00:00<?, ?it/s]

量子化ウィスパー・エンコーダーおよびデコーダーモデル¶

nncf.quantize() API を使用してエンコーダー・モデルとデコーダーモデルの両方を量子化し、量子化された IR を保存します。

                                            import nncf
from openvino.runtime import serialize

print("Quantizing encoder...")
quantized_encoder = nncf.quantize(
    model=model_fp32.encoder.model,
    calibration_dataset=nncf.Dataset(encoder_calibration_data),
    subset_size=len(encoder_calibration_data),
    model_type=nncf.ModelType.TRANSFORMER,
    advanced_parameters=nncf.AdvancedQuantizationParameters(
        smooth_quant_alpha=0.5      # Smooth Quant algorithm reduces activation quantization error; optimal alpha value was obtained through grid search
    )
)
serialize(quantized_encoder, WHISPER_ENCODER_OV_INT8)
print(f"Saved quantized encoder at ./{WHISPER_ENCODER_OV_INT8}")

print("Quantizing decoder...")
quantized_decoder = nncf.quantize(
    model=model_fp32.decoder.model,
    calibration_dataset=nncf.Dataset(decoder_calibration_data),
    subset_size=len(decoder_calibration_data),
    model_type=nncf.ModelType.TRANSFORMER,
    advanced_parameters=nncf.AdvancedQuantizationParameters(
        smooth_quant_alpha=0.95     # Smooth Quant algorithm reduces activation quantization error; optimal alpha value was obtained through grid search
    )
)
serialize(quantized_decoder, WHISPER_DECODER_OV_INT8)
print(f"Saved quantized decoder at ./{WHISPER_DECODER_OV_INT8}")

                                        

                                            INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino
Quantizing encoder...

                                            Statistics collection: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [01:42<00:00,  1.72s/it]
Applying Smooth Quant: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:13<00:00,  9.71it/s]

                                            INFO:nncf:96 ignored nodes was found by name in the NNCFGraph

                                        

                                            Statistics collection: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [03:17<00:00,  3.29s/it]
Applying Fast Bias correction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 162/162 [03:09<00:00,  1.17s/it]

                                            Saved quantized encoder at ./whisper_large-v2_encoder_int8.xml
Quantizing decoder...

                                            Statistics collection: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 669/669 [03:20<00:00,  3.33it/s]
Applying Smooth Quant: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 194/194 [00:23<00:00,  8.41it/s]

                                            INFO:nncf:192 ignored nodes was found by name in the NNCFGraph

                                        

                                            Statistics collection: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 669/669 [07:22<00:00,  1.51it/s]
Applying Fast Bias correction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [04:01<00:00,  1.06it/s]

                                            Saved quantized decoder at ./whisper_large-v2_decoder_int8.xml

                                        

量子化された OpenVINO モデルによりビデオを転写¶

上記で保存した INT8 モデルを Whisper モデルの新しいインスタンスにロードします。

                                        model_int8 = whisper.load_model(model_id.value, device="cpu").eval()
patch_whisper_for_ov_inference(model_int8)

model_int8.encoder = OpenVINOAudioEncoder(core, WHISPER_ENCODER_OV_INT8, device=device.value)
model_int8.decoder = OpenVINOTextDecoder(core, WHISPER_DECODER_OV_INT8, device=device.value)

227-whisper-convert ノートブックと同様に、文字起こしするビデオを選択します。

                                        VIDEO_LINK = "https://youtu.be/kgL5LBM-hFI"
link = widgets.Text(
    value=VIDEO_LINK,
    placeholder="Type link for video",
    description="Video:",
    disabled=False
)
link

                                    

                                        Text(value='https://youtu.be/kgL5LBM-hFI', description='Video:', placeholder='Type link for video')

                                    

                                        from pytube import YouTube

print(f"Downloading video {link.value} started")

output_file = Path("downloaded_video.mp4")
yt = YouTube(link.value)
yt.streams.get_highest_resolution().download(filename=output_file)
print(f"Video saved to {output_file}")

                                    

                                        Downloading video https://youtu.be/kgL5LBM-hFI started
Video saved to downloaded_video.mp4

                                        from utils import get_audio

audio, duration = get_audio(output_file)

量子化モデルによる転写を実行します。

                                        transcription = model_int8.transcribe(audio, task=task.value)

                                    

                                        from utils import prepare_srt

srt_lines = prepare_srt(transcription, duration)
# save transcription
with output_file.with_suffix(".srt").open("w") as f:
    f.writelines(srt_lines)

                                    

それでは結果を確認します。

                                        widgets.Video.from_file(output_file, loop=False, width=800, height=800)

                                    

Video(value=b"x00x00x00x18ftypmp42x00x00x00x00isommp42x00x00:'moovx00x00x00lmvhd...", height='800…

                                        print("".join(srt_lines))

                                    

                                        1
00:00:00,000 --> 00:00:05,000
 What's that?

2
00:00:05,000 --> 00:00:07,000
 Oh, wow.

3
00:00:09,000 --> 00:00:11,000
 Hello, humans.

4
00:00:13,000 --> 00:00:15,000
 Focus on me.

5
00:00:15,000 --> 00:00:17,000
 Focus on the guard.

6
00:00:17,000 --> 00:00:20,000
 Don't tell anyone what you see in here.

7
00:00:22,000 --> 00:00:24,000
 Have you seen what's in there?

8
00:00:24,000 --> 00:00:25,000
 They have...

9
00:00:25,000 --> 00:00:27,000
 Intel. This is where it all changes.

                                    

結果はほぼ同じです。

FP32 と INT8 IR のパフォーマンスと精度を比較¶

モデルのファイルサイズを比較します。

                                        def calculate_compression_rate(model_path_ov, model_path_ov_int8):
    model_size_fp32 = model_path_ov.with_suffix(".bin").stat().st_size / 1024
    model_size_int8 = model_path_ov_int8.with_suffix(".bin").stat().st_size / 1024
    print(f"Model: {model_path_ov.stem}")
    print(f"    * FP32 IR model size: {model_size_fp32:.2f} KB")
    print(f"    * INT8 IR model size: {model_size_int8:.2f} KB")
    print(f"    * Model compression rate: {model_size_fp32 / model_size_int8:.3f}")

calculate_compression_rate(WHISPER_ENCODER_OV, WHISPER_ENCODER_OV_INT8)
calculate_compression_rate(WHISPER_DECODER_OV, WHISPER_DECODER_OV_INT8)

                                    

                                        Model: whisper_large-v2_encoder
    * FP32 IR model size: 1244080.07 KB
    * INT8 IR model size: 626971.58 KB
    * Model compression rate: 1.984
Model: whisper_large-v2_decoder
    * FP32 IR model size: 1900607.09 KB
    * INT8 IR model size: 955679.81 KB
    * Model compression rate: 1.989

                                    

FP32 と INT8 エンコーダー/デコーダーモデルの推論パフォーマンスを測定するには、キャリブレーション・データセットの推論時間の中央値を使用します。したがって、動的量子化モデルの速度向上を見積もることができます。

注: 最も正確なパフォーマンス推定を行うには、他のアプリケーションを閉じた後、ターミナル/コマンドプロンプトで静的形状を使用して benchmark_app を実行することを推奨します。

                                        import time
import numpy as np

def calculate_call_inference_time(model, dataset):
    inference_time = []
    for data_item in tqdm(dataset[:100], desc="Measuring performance"):
        start = time.perf_counter()
        model(data_item)
        end = time.perf_counter()
        delta = end - start
        inference_time.append(delta)
    return np.median(inference_time)


encoder_time_fp32 = calculate_call_inference_time(model_fp32.encoder.compiled_model, encoder_calibration_data)
encoder_time_int8 = calculate_call_inference_time(model_int8.encoder.compiled_model, encoder_calibration_data)
print(f"Encoder performance speedup: {encoder_time_fp32 / encoder_time_int8:.3f}")

decoder_time_fp32 = calculate_call_inference_time(model_fp32.decoder.compiled_model, decoder_calibration_data)
decoder_time_int8 = calculate_call_inference_time(model_int8.decoder.compiled_model, decoder_calibration_data)
print(f"Decoder performance speedup: {decoder_time_fp32 / decoder_time_int8:.3f}")

                                    

Measuring performance:   0%|          | 0/60 [00:00<?, ?it/s]

Measuring performance:   0%|          | 0/60 [00:00<?, ?it/s]

                                        Encoder performance speedup: 1.763

                                    

Measuring performance:   0%|          | 0/100 [00:00<?, ?it/s]

Measuring performance:   0%|          | 0/100 [00:00<?, ?it/s]

                                        Decoder performance speedup: 2.022

                                    

単一の Whisper transcribe() 呼び出しが複数のエンコーダーとデコーダーの推論呼び出しをトリガーするため、転写全体のパフォーマンスを個別に測定します。これらの呼び出し数は、モデルの精度に応じて動的に変化します。モデルの転写時間は一定ではないため、ここでは中央値の代わりに平均時間を使用します。

また、librispeech_asr テスト・データセットのサブセットで FP32 と INT8 モデルの精度値を比較します。Word Error Rate (WER) メトリックに依存し、精度を ((1 - WER)) として計算します。

                                        from evaluate import load
from transformers import WhisperProcessor

wer = load("wer")

TEST_DATASET_SIZE = 100
test_dataset = load_dataset("librispeech_asr", "clean", split="test", streaming=True).take(TEST_DATASET_SIZE)

def calculate_transcription_time_and_accuracy(model, dataset):
    processor = WhisperProcessor.from_pretrained("openai/whisper-large")

    ground_truths = []
    predictions = []
    inference_time = []
    for data_item in tqdm(dataset, desc="Measuring performance and accuracy", total=TEST_DATASET_SIZE):
        audio = data_item["audio"]["array"].astype("float32")

        start_time = time.perf_counter()
        transcription = model.transcribe(audio, task=task.value)
        end_time = time.perf_counter()
        delta_time = end_time - start_time

        reference = processor.tokenizer._normalize(data_item["text"])
        prediction = processor.tokenizer._normalize(transcription["text"])
        ground_truths.append(reference)
        predictions.append(prediction)
        inference_time.append(delta_time)

    word_accuracy = (1 - wer.compute(references=ground_truths, predictions=predictions)) * 100
    mean_inference_time = np.mean(inference_time)
    return mean_inference_time, word_accuracy

transcription_time_fp32, accuracy_fp32 = calculate_transcription_time_and_accuracy(model_fp32, test_dataset)
transcription_time_int8, accuracy_int8 = calculate_transcription_time_and_accuracy(model_int8, test_dataset)
print(f"Whisper transcription performance speedup: {transcription_time_fp32 / transcription_time_int8:.3f}")
print(f"Whisper transcription word accuracy. FP32: {accuracy_fp32:.2f}%. INT8: {accuracy_int8:.2f}%. Accuracy drop :{accuracy_fp32 - accuracy_int8:.2f}%.")

                                    

                                        Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

                                    

Measuring performance and accuracy:   0%|          | 0/100 [00:00<?, ?it/s]

                                        Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

                                    

Measuring performance and accuracy:   0%|          | 0/100 [00:00<?, ?it/s]

 Whisper transcription performance speedup: 1.799
 Whisper transcription word accuracy. FP32: 98.41%. INT8: 97.51%. Accuracy drop :0.90%.


NOTE: Accuracy drop can generally be improved by increasing
calibration dataset size.