OpenVINO™ を使用した Hugging Face モデルハブ#

この Jupyter ノートブックはオンラインで起動でき、ブラウザーのウィンドウで対話型環境を開きます。ローカルにインストールすることもできます。次のオプションのいずれかを選択します:

Hugging Face (HF) モデルハブは、事前トレーニングされたディープラーニング・モデルの中核リポジトリーです。これにより、探索が可能になり、テキスト分類、質問応答、画像分類などの幅広いタスク用の何千ものモデルへのアクセスが可能になります。Hugging Face は、最先端の事前トレーニング済みモデル、つまりトランスフォーマーおよび diffusers パッケージを簡単にダウンロードして微調整するための API およびツールとして機能する Python パッケージを提供します。

このノートブックを通じて、次のことを学びます: 1.トランスフォーマー・パッケージを使用して HF パイプラインをロードし、それを OpenVINO に変換する方法。2.Optimum Intel パッケージを使用して同じパイプラインをロードする方法。

目次:

HF Transformers パッケージからのモデルの変換
Optimum Intel パッケージを使用したモデルの変換

HF Transformers パッケージからのモデルの変換#

Hugging Face Transformers パッケージは、モデルを初期化し、モデル・テキスト・ハンドルを使用して事前トレーニングされた重みのセットをロードする API を提供します。HF ウェブサイトのモデルページを使用すると、目的のモデル名を簡単に見つけることができます。特定のマシンラーニングの問題を解決するモデルを選択したり、人気や新規性によってモデルを並べ替えたりすることもできます。

要件をインストール#

%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu "transformers>=4.33.0" "torch>=2.1.0" 
%pip install -q ipywidgets 
%pip install -q "openvino>=2023.1.0"

Note: you may need to restart the kernel to use updated packages. 
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

インポート#

from pathlib import Path 

import numpy as np 
import torch 

from transformers import AutoModelForSequenceClassification 
from transformers import AutoTokenizer

HF Transformers パッケージを使用したモデルの初期化#

この例では roberta テキスト分類モデルを使用します。これは、特別な方法で事前トレーニングされたトランスフォーマー・ベースのエンコーダー・モデルです。詳細については、モデルカードを参照してください。

モデルページの指示に従って、AutoModelForSequenceClassification を使用してモデルを初期化し、それを使用して推論を実行します。HF パイプラインとモデルの初期化の詳細については、HF チュートリアルを参照してください。

MODEL = "cardiffnlp/twitter-roberta-base-sentiment-latest" 

tokenizer = AutoTokenizer.from_pretrained(MODEL, return_dict=True) 

# torchscript=True フラグは、モデル出力が ModelOutput (JIT エラーの原因) ではなく、 
# タプルであることを確認するために使用されます 
model = AutoModelForSequenceClassification.from_pretrained(MODEL, torchscript=True)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight'] 
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). 
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

元のモデルの推論#

以下で簡単なプロンプトを分類してみましょう。

text = "HF models run perfectly with OpenVINO!" 

encoded_input = tokenizer(text, return_tensors="pt") 
output = model(**encoded_input) 
scores = output[0][0] 
scores = torch.softmax(scores, dim=0).numpy(force=True) 

def print_prediction(scores): 
    for i, descending_index in enumerate(scores.argsort()[::-1]): 
        label = model.config.id2label[descending_index] 
        score = np.round(float(scores[descending_index]), 4) 
        print(f"{i+1}) {label} {score}") 

print_prediction(scores)

1) positive 0.9485 
2) neutral 0.0484 
3) negative 0.0031

モデルを OpenVINO IR 形式に変換#

OpenVINO モデル・トランスフォーメーション API を使用して、モデル (これは PyTorch で実装されています) を OpenVINO 中間表現 (IR) に変換します。

実際の encoded_input を再利用し、それを ov.convert_model 関数に渡す方法に注目してください。モデルのトレースに使用されます。

import openvino as ov 

save_model_path = Path("./models/model.xml") 

if not save_model_path.exists(): 
    ov_model = ov.convert_model(model, example_input=dict(encoded_input)) 
    ov.save_model(ov_model, save_model_path)

/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-727/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/transformers/modeling_utils.py:4565: FutureWarning: _is_quantized_training_enabled is going to be deprecated in transformers 4.39.0.Please use model.hf_quantizer.is_trainable instead 
  warnings.warn(

変換されたモデルの推論#

最初にモデル推論を行うデバイスを選択します

import ipywidgets as widgets 

core = ov.Core() 
device = widgets.Dropdown( 
    options=core.available_devices + ["AUTO"], 
    value="AUTO", 
    description="Device:", 
    disabled=False, 
) 

device

Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

OpenVINO モデル IR は、モデル推論の前に特定のデバイス用にコンパイルする必要があります。

compiled_model = core.compile_model(save_model_path, device.value) 

# コンパイルされたモデルの呼び出しは、元のモデルと同じパラメーターを使用して実行 
scores_ov = compiled_model(encoded_input.data)[0] 

scores_ov = torch.softmax(torch.tensor(scores_ov[0]), dim=0).detach().numpy() 

print_prediction(scores_ov)

1) positive 0.9483 
2) neutral 0.0485 
3) negative 0.0031

変換されたモデルの予測は、元のモデルの予測と正確に一致することに注意してください。

パイプラインにはエンコーダー・モデルが 1 つだけ含まれているため、これはかなり単純な例です。最先端のパイプラインは、多くの場合、複数のモデルで構成されています。他の OpenVINO チュートリアルを探索してください。

diffusers パッケージのワークフローは全く同じです。上記のリストの最初の例は diffusers に依存しています。

Optimum Intel パッケージを使用したモデルの変換#

Optimum Intel は、インテル® アーキテクチャー上のエンドツーエンドのパイプラインを高速化するため、トランスフォーマーおよび Diffuser ライブラリーと、インテルが提供するさまざまなツールとライブラリー間のインターフェイスです。

他のユースケースでも、Optimum Intel は、トランスフォーマーおよびディフューザー・モデルを最適化し、それらを OpenVINO 中間表現 (IR) 形式に変換し、OpenVINO ランタイムを使用して推論を実行するシンプルなインターフェイスを提供します。

Optimum のインストール要件#

%pip install -q "git+https://github.com/huggingface/optimum-intel.git" onnx

huggingface/tokenizers: The current process just got forked, after parallelism has already been used.Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using tokenizers before the fork if possible 
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Note: you may need to restart the kernel to use updated packages.

Optimum のインポート#

Optimum Intel のドキュメントには次のように記載されています: さまざまなインテル® プロセッサーで OpenVINO ランタイムを使用して推論を簡単に実行できるようになりました (サポートされているデバイスの完全なリストを参照してください)。それには、AutoModelForXxx クラスを対応する OVModelForXxx クラスに置き換えるだけです。

詳細については、Optimum Intel のドキュメントを参照してください。

from optimum.intel.openvino import OVModelForSequenceClassification

huggingface/tokenizers: The current process just got forked, after parallelism has already been used.Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using tokenizers before the fork if possible 
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)I tensorflow/core/util/port.cc:110] oneDNN custom operations are on.You may see slightly different numerical results due to floating-point round-off errors from different computation orders.
To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-07-13 00:35:27.853673: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-13 00:35:28.470157: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-727/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated.Please use torch.utils._pytree.register_pytree_node instead. 
  torch.utils._pytree._register_pytree_node(

OVModel クラスを使用してモデルを自動的に初期化および変換#

トランスフォーマー・モデルをロードし、オンザフライで OpenVINO 形式に変換するには、モデルをロードするときに export=True を設定します。save_pretrained メソッドを使用し、引数にモデルを格納するディレクトリーを指定することで、モデルを OpenVINO 形式で保存できます。次回の使用では、変換ステップを回避し、エクスポート指定なしで from_pretrained メソッドを使用して、保存された初期モデルをディスクからロードできます。また、特定のデバイスでモデルをコンパイルするデバイス・パラメーターも指定しました。指定されていない場合は、デフォルトデバイスが使用されます。デバイスは、model.to(device) を使用して実行時に変更できます。新しく選択したデバイス向けのモデルのコンパイルには時間がかかることがあります。場合によっては、モデルの初期化とコンパイルを分離すると便利な場合があります。例えば、reshape メソッドを使用してモデルを再形成する場合は、from_pretrained メソッドにパラメーター compile=False を指定してコンパイルを延期できます。コンパイルは手動で実行できます。compile メソッドを実行するか、最初の推論実行時に自動的に実行されます。

model = OVModelForSequenceClassification.from_pretrained(MODEL, export=True, device=device.value) 

# save_pretrained() メソッドは、次回のロード時に変換を回避するためモデルの重みを保存 
model.save_pretrained("./models/optimum_model")

Framework not specified.Using pt to export the model. Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight'] 
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).Using framework PyTorch: 2.3.1+cpu Overriding 1 configuration item(s) 
    - use_cache -> False

WARNING:tensorflow:Please fix your imports.Module tensorflow.python.training.tracking.base has been moved to tensorflow.python.trackable.base.The old module will be deleted in version 2.11.

Compiling the model to AUTO ...

Optimum CLI インターフェイスを使用してモデルを変換#

Optimum CLI インターフェイスを使用してモデルを変換することもできます (optimum-intel 1.12 バージョン以降でサポートされています)。一般的なコマンド形式:

optimum-cli export openvino --model <model_id_or_path> --task <task> <output_dir>

ここで、task はモデルをエクスポートするタスクです。指定されていない場合は、モデルに基づいてタスクが自動的に推論されます。利用可能なタスクはモデルによって異なりますが、次のとおりです: [‘default’, ‘fill-mask’, ‘text-generation’, ‘text2text-generation’, ‘text-classification’, ‘token-classification’, ‘multiple-choice’, ‘object-detection’, ‘question-answering’, ‘image-classification’, ‘image-segmentation’, ‘masked-im’, ‘semantic-segmentation’, ‘automatic-speech-recognition’, ‘audio-classification’, ‘audio-frame-classification’, ‘automatic-speech-recognition’, ‘audio-xvector’, ‘image-to-text’, ‘stable-diffusion’, ‘zero-shot-object-detection’]。デコーダーモデルの場合、xxx-with-past を使用して、デコーダーの過去のキー値を使用してモデルをエクスポートします。

タスクとモデルクラス間のマッピングについては、Optimum TaskManager のドキュメントを参照してください。

さらに、圧縮モデルを FP16 にする場合は --fp16 の重みを指定し、圧縮モデルを INT8 にする場合は --int8 の重みを指定することもできます。INT8 の場合は nncf をインストールする必要があることに注意してください。

サポートされている引数の完全なリストは --help で参照できます:

!optimum-cli export openvino --help

huggingface/tokenizers: The current process just got forked, after parallelism has already been used.Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using tokenizers before the fork if possible 
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

2024-07-13 00:35:41.047556: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
usage: optimum-cli export openvino [-h] -m MODEL [--task TASK]
[--framework {pt,tf}] [--trust-remote-code]
[--weight-format {fp32,fp16,int8,int4,int4_sym_g128,int4_asym_g128,int4_sym_g64,int4_asym_g64}]
[--library {transformers,diffusers,timm,sentence_transformers}]
[--cache_dir CACHE_DIR]
[--pad-token-id PAD_TOKEN_ID]
[--ratio RATIO] [--sym]
[--group-size GROUP_SIZE]
[--dataset DATASET] [--all-layers] [--awq]
[--scale-estimation]
[--sensitivity-metric SENSITIVITY_METRIC]
[--num-samples NUM_SAMPLES]
[--disable-stateful]
[--disable-convert-tokenizer] [--fp16]
[--int8] [--convert-tokenizer]
output

optional arguments:
-h, --help show this help message and exit

Required arguments:
-m MODEL, --model MODEL
Model ID on huggingface.co or path on disk to load model from.
Output Path indicating the directory where to store the generated OV model.
Optional arguments:
--task TASK The task to export the model for. If not specified, the task will be auto-inferred based on the model. Available tasks depend on the model, but are among: ['text-generation', 'text-to-audio', 'conversational', 'fill-mask', 'audio-classification', 'token- classification', 'zero-shot-object-detection', 'text- classification', 'stable-diffusion-xl', 'question- answering', 'feature-extraction', 'text2text- generation', 'sentence-similarity', 'image- segmentation', 'automatic-speech-recognition', 'depth- estimation', 'image-to-image', 'image-classification', 'stable-diffusion', 'audio-frame-classification', 'semantic-segmentation', 'mask-generation', 'multiple- choice', 'audio-xvector', 'image-to-text', 'object- detection', 'zero-shot-image-classification', 'masked- im']. For decoder models, use xxx-with-past to export the model using past key values in the decoder.
--framework {pt,tf} The framework to use for the export. If not provided, will attempt to use the local checkpoint's original framework or what is available in the environment.
--trust-remote-code Allows to use custom code for the modeling hosted in the model repository. This option should only be set for repositories you trust and in which you have read the code, as it will execute on your local machine arbitrary code present in the model repository.
--weight-format
{fp32,fp16,int8,int4,int4_sym_g128,int4_asym_g128,int4_sym_g64,int4_asym_g64} he weight format of the exported model.
--library {transformers,diffusers,timm,sentence_transformers}
The library used to load the model before export. If not provided, will attempt to infer the local checkpoint's library --cache_dir CACHE_DIR The path to a directory in which the downloaded model should be cached if the standard cache should not be used.
--pad-token-id PAD_TOKEN_ID
This is needed by some models, for some tasks. If not provided, will attempt to use the tokenizer to guess it.
--ratio RATIO A parameter used when applying 4-bit quantization to control the ratio between 4-bit and 8-bit quantization. If set to 0.8, 80% of the layers will be quantized to int4 while 20% will be quantized to int8. This helps to achieve better accuracy at the sacrifice of the model size and inference latency. Default value is 1.0.
--Sym Whether to apply symmetric quantization
--group-size GROUP_SIZE
The group size to use for quantization. Recommended value is 128 and -1 uses per-column quantization.
--dataset DATASET The dataset used for data-aware compression or quantization with NNCF. You can use the one from the list ['wikitext2','c4','c4-new'] for language models or ['conceptual_captions','laion/220k-GPT4Vision- captions-from-LIVIS','laion/filtered-wit'] for diffusion models.
--All-layers Whether embeddings and last MatMul layers should be compressed to INT4. If not provided an weight compression is applied, they are compressed to INT8.
--awq Whether to apply AWQ algorithm. AWQ improves generation quality of INT4-compressed LLMs, but requires additional time for tuning weights on a calibration dataset. To run AWQ, please also provide a dataset argument. Note: it's possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped.
--scale-estimation Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between the original and compressed layers. Providing a dataset is required to run scale estimation. Please note, that applying scale estimation takes additional memory and time.
--sensitivity-metric SENSITIVITY_METRIC
The sensitivity metric for assigning quantization precision to layers. Can be one of the following: ['weight_quantization_error', 'hessian_input_activation', 'mean_activation_variance', 'max_activation_variance', 'mean_activation_magnitude'].
--num-samples NUM_SAMPLES
The maximum number of samples to take from the dataset for quantization.
--disable-stateful Disable stateful converted models, stateless models will be generated instead. Stateful models are produced by default when this key is not used. In stateful models all kv-cache inputs and outputs are hidden in the model and are not exposed as model inputs and outputs. If --disable-stateful option is used, it may result in sub-optimal inference performance. Use it when you intentionally want to use a stateless model, for example, to be compatible with existing OpenVINO native inference code that expects kv-cache inputs and outputs in the model.
--disable-convert-tokenizer
Do not add converted tokenizer and detokenizer OpenVINO models.
--fp16 Compress weights to fp16
--int8 Compress weights to int8
--convert-tokenizer [Deprecated] Add converted tokenizer and detokenizer with OpenVINO Tokenizers.

FP16 重み圧縮を使用した上記の例のモデルのコマンドライン・エクスポート:

!optimum-cli export openvino --model $MODEL --task text-classification --fp16 models/optimum_model/fp16

huggingface/tokenizers: The current process just got forked, after parallelism has already been used.Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using tokenizers before the fork if possible 
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

2024-07-13 00:35:45.994137: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-727/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated.Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
--fp16 option is deprecated and will be removed in a future version. Use --weight-format instead. Framework not specified. Using pt to export the model. Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Using framework PyTorch: 2.3.1+cpu
Overriding 1 configuration item(s)
- use_cache -> False
OpenVINO Tokenizers is not available. To deploy models in production with C++ code, please follow installation instructions: openvinotoolkit/openvino_tokenizers

Tokenizer won't be converted.

エクスポート後、モデルは指定されたディレクトリーで使用可能になり、同じ OVModelForXXX クラスを使用してロードできます。

model = OVModelForSequenceClassification.from_pretrained("models/optimum_model/fp16", device=device.value)

Compiling the model to AUTO ...

Hugging Face モデルハブには、すでに変換されており、すぐに実行できるモデルがいくつかあります。これらのモデルをライブラリー名でフィルター処理するか、[OpenVINO] と入力するか、このリンクに従ってください。

Optimum なモデルの推論#

モデルの推論は元のモデルと全く同じです。

output = model(**encoded_input) 
scores = output[0][0] 
scores = torch.softmax(scores, dim=0).numpy(force=True) 

print_prediction(scores)

1) positive 0.9483 
2) neutral 0.0485 
3) negative 0.0031

追加の Optimum Intel の使用例は以下をご覧ください:
1. スパース・トランスフォーマー・モデルの推論を高速化
2. OpenVINO による文法エラー修正
3. Optimum-Intel OpenVINO を使用した Stable Diffusion v2.1
4. Stable Diffusion XL による画像生成
5. Databricks Dolly 2.0 を使用した手順
6. OpenVINO を使用して LLM を利用したチャットボットを作成
7. Pix2Struct と OpenVINO を使用したドキュメントのビジュアル質問回答
8. Distil-Whisper と OpenVINO を使用した自動音声認識