Stable-Zephyr-3b と OpenVINO を使用した LLM 搭載チャットボット#

この Jupyter ノートブックは、ローカルへのインストール後にのみ起動できます。

GitHub

急速に進化する人工知能 (AI) の世界では、チャットボットは企業が顧客とのやり取りを強化し、業務を効率化する強力なツールとなっています。大規模言語モデル (LLM) は、人間の言語を理解して生成できる人工知能システムです。ディープラーニング・アルゴリズムと大量のデータを使用して言語のニュアンスを学習し、一貫性のある適切な応答を生成します。適切なインテントベースのチャットボットは注文管理、FAQ、ポリシーに関する質問などの基本的なワンタッチの質問に答えることができますが、LLM チャットボットはより複雑なマルチタッチの質問に対処できます。LLM を使用すると、チャットボットがコンテキスト記憶を通じて、人間と同様の会話形式でサポートを提供できるようになります。言語モデルの機能を活用することで、チャットボットはますますインテリジェントになり、驚くべき正確さで人間の言語を理解して応答できるようになりました。

Stable Zephyr 3B は、30 億のパラメーターを持つモデルであり、多くの LLM 評価ベンチマークで優れた結果を示し、比較的小規模な多くの一般的なモデルを上回りました。HugginFaceH4 の Zephyr 7B トレーニング・パイプラインに触発されたこのモデルは、公開されているデータセット、Direct Preference Optimization (DPO) を使用した合成データセット、MT BenchAlpaca Benchmark をベースにこのモデルの評価を組み合わせてトレーニングされました。モデルの詳細については、モデルカードをご覧ください

このチュートリアルでは、OpenVINO ツールキットを使用してモデルを最適化および実行する方法について説明します。変換ステップとモデルのパフォーマンス評価の便宜上、LLM のパフォーマンスを推定する統一されたアプローチを提供する llm_bench ツールを使用します。これは Optimum-Intel が提供するパイプラインに基づいており、ほぼ同じコードを使用して Pytorch および OpenVINO モデルのパフォーマンスを推定できます。また、モデルキャッシュ状態を処理する可能性を提供する、モデルをステートフルにする方法についても説明します。

目次:

必要条件#

作業を開始するには、必要なパッケージをインストールする必要があります

from pathlib import Path 
import sys 

genai_llm_bench = Path("openvino.genai/llm_bench/python") 

if not genai_llm_bench.exists():
     !git clone https://github.com/openvinotoolkit/openvino.genai.git 

sys.path.append(str(genai_llm_bench))
%pip install -q "transformers>=4.38.2" 
%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu -r ./openvino.genai/llm_bench/python/requirements.txt 
%pip install --pre -Uq openvino openvino-tokenizers[transformers] --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly 
%pip install -q "gradio>=4.19"

モデルを OpenVINO 中間表現 (IR) に変換し、NNCF を使用してモデルの重みを INT4 に圧縮#

llm_bench は、LLMS を Optimum-Intel と互換性のある OpenVINO IR 形式に変換するスクリプトを提供します。また、NNCF を使用してモデルの重みを INT8 または INT4 精度に圧縮することもできます。INT4 で重み圧縮を有効にするには、--compress_weights 4BIT_DEFAULT 引数を使用します。重み圧縮アルゴリズムは、モデルの重みを圧縮することを目的としており、大規模言語モデル (LLM) など、重みのサイズが活性化のサイズよりも相対的に大きい大規模モデルのモデル・フットプリントとパフォーマンスを最適化するために使用できます。INT8 圧縮と比較して、INT4 圧縮はパフォーマンスをさらに向上させますが、予測品質は若干低下します。

モデルの状態を自動処理するためステートフル変換を適用#

Stable Zephyr はデコーダーのみのトランスフォーマー・モデルであり、自己回帰方式でトークンごとにテキストを生成します。出力側は自動回帰であるため、出力トークンの非表示状態は、その後の生成ステップごとに計算されると同じままになります。したがって、新しいトークンを生成するたびに再計算するのは無駄であるように思えます。生成プロセスを最適化し、メモリーをさらに効率良く使用するため、HuggingFace トランスフォーマー API は、入力と出力で use_cache=True パラメーターと past_key_values 引数を使用してモデル状態を外部にキャッシュするメカニズムを提供します。キャッシュを使用すると、モデルは計算後に非表示の状態を保存します。モデルは各タイムステップで最後に生成された出力トークンのみを計算し、保存された出力トークンを非表示のトークンに再利用します。これにより、変圧器モデルの生成の複雑さが O(n3) から O(n2) に軽減されます。このオプションを使用すると、モデルは前のステップの非表示状態 (キャッシュされたアテンション・キーと値) を入力として取得し、さらに現在のステップの非表示状態を出力として提供します。これは、次のすべての反復では、前のステップから取得した新しいトークンと、次のトークン予測を取得するためのキャッシュされたキー値のみを提供するだけで十分であることを意味します。

最新の LLM のようにモデルサイズが大きくなると、アテンション・ブロックの数と過去のキー値テンソルのサイズもそれぞれ増加します。推論サイクルでキャッシュ状態をモデルの入力と出力として処理する方法は、特にチャットボットのシナリオなどで長い入力シーケンスを処理する場合、メモリー制限のあるシステムのボトルネックになる可能性があります。OpenVINO は、モデル内にキャッシュ処理ロジックを保持しながら、モデルからキャッシュテンソルを含む入力と対応する出力を削除する変換を提案します。キャッシュを非表示にすると、デバイスに適した表現でキャッシュ値を保存および更新できるようになります。メモリーの消費を削減し、さらにモデルのパフォーマンスを最適化するのに役立ちます。

llm_bench はデフォルトでモデルをステートフル形式に変換します。この動作を無効にしたい場合は、--disable_stateful--disable_stateful フラグを指定します

stateful_model_path = Path("stable-zephyr-3b-stateful/pytorch/dldt/compressed_weights/OV_FP16-4BIT_DEFAULT") 

convert_script = genai_llm_bench / "convert.py" 

if not (stateful_model_path / "openvino_model.xml").exists():
     !python $convert_script --model_id stabilityai/stable-zephyr-3b --precision FP16 --compress_weights 4BIT_DEFAULT --output stable-zephyr-3b-stateful --force_convert
INFO:nncf:NNCF initialized successfully.Supported frameworks detected: torch, tensorflow, onnx, openvino 
2024-03-05 13:50:49.184866: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on.You may see slightly different numerical results due to floating-point round-off errors from different computation orders.To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-03-05 13:50:49.186797: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-05 13:50:49.223416: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-05 13:50:49.223832: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-05 13:50:49.887707: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: 
    PyTorch 2.1.0+cu121 with CUDA 1201 (you have 2.2.0+cpu) 
    Python 3.8.18 (you have 3.8.10) 
  Please reinstall xformers (see facebookresearch/xformers) 
  Memory-efficient attention, SwiGLU, sparse and more won't be available.   Set XFORMERS_MORE_DETAILS=1 for more details 
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated.Please use torch.utils._pytree.register_pytree_node instead. 
  torch.utils._pytree._register_pytree_node( WARNING:nncf:NNCF provides best results with torch==2.2.1, while current torch version is 2.2.0+cpu.If you encounter issues, consider switching to torch==2.2.1 
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support.8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. 
  warn("The installed version of bitsandbytes was compiled without GPU support." 
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32 
[ INFO ] openvino runtime version: 2024.1.0-14645-e6dc0865128 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 
[ INFO ] Model conversion to FP16 will be skipped as found converted model stable-zephyr-3b-stateful/pytorch/dldt/FP16/openvino_model.xml.If it is not expected behaviour, please remove previously converted model or use --force_convert option 
[ INFO ] Compress model weights to 4BIT_DEFAULT 
[ INFO ] Compression options: 
[ INFO ] {'mode': <CompressWeightsMode.INT4_SYM: 'int4_sym'>, 'group_size': 128} INFO:nncf:Statistics of the bitwidth distribution: 
+--------------+---------------------------+-----------------------------------+ 
| Num bits (N) | % all parameters (layers) |    % ratio-defining parameters    |
|              |                           |             (layers)              | 
+==============+===========================+===================================+ 
|  8           | 9% (2 / 226)              |  0% (0 / 224)                     | 
+--------------+---------------------------+-----------------------------------+ 
|  4           | 91% (224 / 226)           |  100% (224 / 224)                 | 
+--------------+---------------------------+-----------------------------------+ 
[2KApplying Weight Compression ━━━━━━━━━━━━━━━━━━━ 100% 226/226 • 0:01:29 • 0:00:00;0;104;181m0:00:01181m0:00:05

推論するデバイスを選択#

import ipywidgets as widgets 
import openvino as 

ov core = ov.Core() 

device = widgets.Dropdown( 
    options=core.available_devices, 
    value="CPU", 
    description="Device:", 
    disabled=False, 
) 

device
Dropdown(description='Device:', options=('CPU', 'GPU.0', 'GPU.1'), value='CPU')

モデルのパフォーマンスを推定#

openvino.genai / llm_bench / python / benchmark.py スクリプトを使用すると、指定された最大生成トークン数を使用して、特定の入力プロンプトでテキスト生成パイプライン推論を推定できます。

benchmark_script = genai_llm_bench / "benchmark.py" 

!python $benchmark_script -m $stateful_model_path -ic 512 -p "Tell me story about cats" -d $device.value
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated.Please use torch.utils._pytree.register_pytree_node instead. 
  torch.utils._pytree._register_pytree_node( WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.1.0+cu121 with CUDA 1201 (you have 2.2.0+cpu) 
    Python 3.8.18 (you have 3.8.10) 
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) 
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details 
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated.Please use torch.utils._pytree.register_pytree_node instead. 
  torch.utils._pytree._register_pytree_node( 
INFO:nncf:NNCF initialized successfully.Supported frameworks detected: torch, tensorflow, onnx, openvino 
2024-03-05 13:52:39.048911: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on.You may see slightly different numerical results due to floating-point round-off errors from different computation orders.To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-03-05 13:52:39.050779: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-05 13:52:39.088178: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-05 13:52:39.088623: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-05 13:52:39.754578: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support.8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. 
  warn("The installed version of bitsandbytes was compiled without GPU support. " 
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated.Please use torch.utils._pytree.register_pytree_node instead. 
  torch.utils._pytree._register_pytree_node( 
[ INFO ] ==SUCCESS FOUND==: use_case: text_gen, model_type: stable-zephyr-3b-stateful 
[ INFO ] OV Config={'PERFORMANCE_HINT': 'LATENCY', 'CACHE_DIR': '', 'NUM_STREAMS': '1'} 
[ INFO ] OPENVINO_TORCH_BACKEND_DEVICE=CPU 
[ INFO ] Model path=stable-zephyr-3b-stateful/pytorch/dldt/compressed_weights/OV_FP16-4BIT_DEFAULT, openvino runtime version: 2024.1.0-14645-e6dc0865128 Compiling the model to CPU ... 
[ INFO ] From pretrained time: 3.21s Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 
[ INFO ] Numbeams: 1, benchmarking iter nums(exclude warm-up): 0, prompt nums: 1 
[ INFO ] [warm-up] Input text: Tell me story about cats Setting pad_token_id to eos_token_id:0 for open-end generation. 
[ INFO ] [warm-up] Input token size: 5, Output size: 336, Infer count: 512, Tokenization Time: 2.23ms, Detokenization Time: 0.51ms, Generation Time: 23.79s, Latency: 70.80 ms/token 
[ INFO ] [warm-up] First token latency: 837.58 ms/token, other tokens latency: 68.43 ms/token, len of tokens: 336 
[ INFO ] [warm-up] First infer latency: 836.44 ms/infer, other infers latency: 67.89 ms/infer, inference count: 336 
[ INFO ] [warm-up] Result MD5:['601aa0958ff0e0f9b844a9e6d186fbd9'] 
[ INFO ] [warm-up] Generated: Tell me story about cats and dogs. 
Once upon a time, in a small village, there lived a young girl named Lily. She had two pets, a cat named Mittens and a dog named Max. Mittens was a beautiful black cat with green eyes, and Max was a big lovable golden retriever with a wagging tail. 
One sunny day, Lily decided to take her pets for a walk in the nearby forest. As they were walking, they heard a loud barking sound. Suddenly, a group of dogs appeared from the bushes, led by a big brown dog with a friendly smile. 
Lily was scared at first, but Max quickly jumped in front of her and growled at the dogs. The big brown dog introduced himself as Rocky and explained that he and his friends were just out for a walk too. Lily and Rocky became fast friends, and they often went on walks together. Max and Rocky got along well too, and they would play together in the forest. 
One day, while Lily was at school, Mittens and Max decided to explore the forest and stumbled upon a group of stray cats. The cats were hungry and scared, so Mittens and Max decided to help them by giving them some food. The cats were grateful and thanked Mittens and Max for their kindness. They even allowed Mittens to climb on their backs and enjoy the sun. 
From that day on, Mittens and Max became known as the village's cat and dog heroes. They were always there to help their furry friends in need. 
And so, Lily learned that sometimes the best friends are the ones that share the same love for pets.<|endoftext|>

状態のないモデルと比較#

stateless_model_path = Path("stable-zephyr-3b-stateless/pytorch/dldt/compressed_weights/OV_FP16-4BIT_DEFAULT") 

if not (stateless_model_path / "openvino_model.xml").exists():
     !python $convert_script --model_id stabilityai/stable-zephyr-3b --precision FP16 --compress_weights 4BIT_DEFAULT --output stable-zephyr-3b-stateless --force_convert --disable-stateful
INFO:nncf:NNCF initialized successfully.Supported frameworks detected: torch, tensorflow, onnx, openvino 
2024-03-05 13:53:12.727472: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on.You may see slightly different numerical results due to floating-point round-off errors from different computation orders.To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-03-05 13:53:12.729379: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-05 13:53:12.765262: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-05 13:53:12.765680: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-05 13:53:13.414451: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.1.0+cu121 with CUDA 1201 (you have 2.2.0+cpu) 
    Python 3.8.18 (you have 3.8.10) 
  Please reinstall xformers (see facebookresearch/xformers) 
  Memory-efficient attention, SwiGLU, sparse and more won't be available.   Set XFORMERS_MORE_DETAILS=1 for more details 
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated.Please use torch.utils._pytree.register_pytree_node instead. 
  torch.utils._pytree._register_pytree_node( WARNING:nncf:NNCF provides best results with torch==2.2.1, while current torch version is 2.2.0+cpu.If you encounter issues, consider switching to torch==2.2.1 
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support.8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. 
  warn("The installed version of bitsandbytes was compiled without GPU support." 
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32 
[ INFO ] openvino runtime version: 2024.1.0-14645-e6dc0865128 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.Using the export variant default.Available variants are:
    - default: The default ONNX variant.Using framework PyTorch: 2.2.0+cpu 
Overriding 1 configuration item(s) 
    - use_cache -> True 
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/modeling_utils.py:4193: FutureWarning: _is_quantized_training_enabled is going to be deprecated in transformers 4.39.0.Please use model.hf_quantizer.is_trainable instead 
  warnings.warn( 
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/modeling_attn_mask_utils.py:114: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect.We can't record the data flow of Python values, so this value will be treated as a constant in the future.This means that the trace might not generalize to other inputs! 
  if (input_shape[-1] > 1 or self.sliding_window is not None) and self.is_causal: 
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/optimum/exporters/onnx/model_patcher.py:299: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect.We can't record the data flow of Python values, so this value will be treated as a constant in the future.This means that the trace might not generalize to other inputs! 
  if past_key_values_length > 0: 
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/models/stablelm/modeling_stablelm.py:97: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect.We can't record the data flow of Python values, so this value will be treated as a constant in the future.This means that the trace might not generalize to other inputs! 
  if seq_len > self.max_seq_len_cached: 
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/models/stablelm/modeling_stablelm.py:341: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect.We can't record the data flow of Python values, so this value will be treated as a constant in the future.This means that the trace might not generalize to other inputs! 
  if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len): 
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/models/stablelm/modeling_stablelm.py:348: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect.We can't record the data flow of Python values, so this value will be treated as a constant in the future.This means that the trace might not generalize to other inputs! 
  if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
 /home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/models/stablelm/modeling_stablelm.py:360: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect.We can't record the data flow of Python values, so this value will be treated as a constant in the future.This means that the trace might not generalize to other inputs! 
  if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim): 
[ INFO ] Compress model weights to 4BIT_DEFAULT 
[ INFO ] Compression options: 
[ INFO ] {'mode': <CompressWeightsMode.INT4_SYM: 'int4_sym'>, 'group_size': 128} INFO:nncf:Statistics of the bitwidth distribution: 
+--------------+---------------------------+-----------------------------------+ 
| Num bits (N) | % all parameters (layers) |    % ratio-defining parameters    | 
|              |                           |             (layers)              | 
+==============+===========================+===================================+ 
| 8            | 9% (2 / 226)              | 0% (0 / 224)                      | 
+--------------+---------------------------+-----------------------------------+ 
| 4            | 91% (224 / 226)           | 100% (224 / 224)                  | 
+--------------+---------------------------+-----------------------------------+ 
[2KApplying Weight Compression ━━━━━━━━━━━━━━━━━━━ 100% 226/226 • 0:01:29 • 0:00:00;0;104;181m0:00:01181m0:00:05
!python $benchmark_script -m $stateless_model_path -ic 512 -p "Tell me story about cats" -d $device.value
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated.Please use torch.utils._pytree.register_pytree_node instead. 
  torch.utils._pytree._register_pytree_node( WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 2.1.0+cu121 with CUDA 1201 (you have 2.2.0+cpu) 
    Python 3.8.18 (you have 3.8.10) 
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) 
  Memory-efficient attention, SwiGLU, sparse and more won't be available.Set XFORMERS_MORE_DETAILS=1 for more details 
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated.Please use torch.utils._pytree.register_pytree_node instead. 
  torch.utils._pytree._register_pytree_node( 
INFO:nncf:NNCF initialized successfully.Supported frameworks detected: torch, tensorflow, onnx, openvino 
2024-03-05 13:55:27.540258: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on.You may see slightly different numerical results due to floating-point round-off errors from different computation orders.To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-03-05 13:55:27.542166: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-05 13:55:27.578718: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-05 13:55:27.579116: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-05 13:55:28.229026: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
 /home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support.8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. 
  warn("The installed version of bitsandbytes was compiled without GPU support." 
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32 
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated.Please use torch.utils._pytree.register_pytree_node instead. 
  torch.utils._pytree._register_pytree_node( 
[ INFO ] ==SUCCESS FOUND==: use_case: text_gen, model_type: stable-zephyr-3b-stateless 
[ INFO ] OV Config={'PERFORMANCE_HINT': 'LATENCY', 'CACHE_DIR': '', 'NUM_STREAMS': '1'} 
[ INFO ] OPENVINO_TORCH_BACKEND_DEVICE=CPU 
[ INFO ] Model path=stable-zephyr-3b-stateless/pytorch/dldt/compressed_weights/OV_FP16-4BIT_DEFAULT, openvino runtime version: 2024.1.0-14645-e6dc0865128 Provided model does not contain state. It may lead to sub-optimal performance. Please reexport model with updated OpenVINO version >= 2023.3.0 calling the from_pretrained method with original model and export=True parameter 
Compiling the model to CPU ...
[ INFO ] From pretrained time: 3.15s Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[ INFO ] Numbeams: 1, benchmarking iter nums(exclude warm-up): 0, prompt nums: 1 
[ INFO ] [warm-up] Input text: Tell me story about cats Setting pad_token_id to eos_token_id:0 for open-end generation.
[ INFO ] [warm-up] Input token size: 5, Output size: 336, Infer count: 512, Tokenization Time: 2.02ms, Detokenization Time: 0.51ms, Generation Time: 18.59s, Latency: 55.32 ms/token 
[ INFO ] [warm-up] First token latency: 990.01 ms/token, other tokens latency: 52.47 ms/token, len of tokens: 336 
[ INFO ] [warm-up] First infer latency: 989.00 ms/infer, other infers latency: 51.98 ms/infer, inference count: 336 
[ INFO ] [warm-up] Result MD5:['601aa0958ff0e0f9b844a9e6d186fbd9'] 
[ INFO ] [warm-up] Generated: Tell me story about cats and dogs.Once upon a time, in a small village, there lived a young girl named Lily.She had two pets, a cat named Mittens and a dog named Max. Mittens was a beautiful black cat with green eyes, and Max was a big lovable golden retriever with a wagging tail.
One sunny day, Lily decided to take her pets for a walk in the nearby forest.As they were walking, they heard a loud barking sound.Suddenly, a group of dogs appeared from the bushes, led by a big brown dog with a friendly smile.
Lily was scared at first, but Max quickly jumped in front of her and growled at the dogs. The big brown dog introduced himself as Rocky and explained that he and his friends were just out for a walk too.Lily and Rocky became fast friends, and they often went on walks together.Max and Rocky got along well too, and they would play together in the forest.
One day, while Lily was at school, Mittens and Max decided to explore the forest and stumbled upon a group of stray cats. The cats were hungry and scared, so Mittens and Max decided to help them by giving them some food. The cats were grateful and thanked Mittens and Max for their kindness. They even allowed Mittens to climb on their backs and enjoy the sun. 
From that day on, Mittens and Max became known as the village's cat and dog heroes. They were always there to help their furry friends in need.
And so, Lily learned that sometimes the best friends are the ones that share the same love for pets. <|endoftext|>

Optimum Intel によるモデルの使用#

Optimum-Intel API を使用してモデルを実行するには、次の手順が必要です: 1. モデルの正規化された構成を登録します。 2. from_pretrained メソッドを使用して OVModelForCausalLM クラスのインスタンスを作成します。

モデルテキスト生成インターフェイスは変更されずに残り、テキスト生成プロセスは ov_model.generate メソッドを実行し、トークナイザーによってエンコードされたテキストを入力として渡すことで開始されます。このメソッドは、トークナイザーを使用してデコードする必要がある生成されたトークン ID のシーケンスを返します。

from optimum.intel.openvino import OVModelForCausalLM 
from transformers import AutoConfig 

ov_model = OVModelForCausalLM.from_pretrained( 
    stateful_model_path, 
    config=AutoConfig.from_pretrained(stateful_model_path, trust_remote_code=True), 
    device=device.value, 
)

インタラクティブなチャットボットのデモ#

これで、モデルの使用する準備が整いました。実際に動作を見てみましょう。モデルとのやり取りには Gradio インターフェイスを使用します。チャット・メッセージ・ボックスにテキストメッセージを入力し、[Submit] ボタンをクリックして会話を開始します。テキスト生成の品質を制御できるパラメーターがいくつかあります: * Temperature は、AI が生成したテキストの創造性のレベルを制御するパラメーターです。temperature を調整することで、AI モデルの確率分布に影響を与え、テキストの焦点を絞ったり、多様にしたりできます。
次を考えてみます。AI モデルは次のトークンの確率で “The cat is ____ .” という文を完成させる必要があります:
playing: 0.5 
sleeping: 0.25 
eating: 0.15 
driving: 0.05 
flying: 0.05 

- **Low temperature** (e.g., 0.2): The AI model becomes more focused and deterministic, choosing tokens with the highest probability, such as "playing."
- **Medium temperature** (e.g., 1.0): The AI model maintains a balance between creativity and focus, selecting tokens based on their probabilities without significant bias, such as "playing," "sleeping," or "eating."
- **High temperature** (e.g., 2.0): The AI model becomes more adventurous, increasing the chances of selecting less likely tokens, such as "driving" and "flying."
  • Top-p は、核サンプリングとも呼ばれる累積確率に基づいて AI モデルによって考慮される、トークンの範囲を制御するパラメーターです。top-p 値を調整することで、AI モデルのトークン選択に影響を与え、焦点を絞ったり、多様性を持たせることができます。猫と同じ例を使用して、次の top_p 設定を検討してください:

    • 低 top_p (例 0.5): AI モデルは、 “playing” など、累積確率が最も高いトークンのみを考慮します。

    • 中 top_p (例 0.8): AI モデルは、“playing”、“sleeping”、“eating” など、累積確率がより高いトークンを考慮します。

    • 高 top_p (例 1.0): AI モデルは、“driving” や “flying” などの確率の低いトークンを含むすべてのトークンを考慮します。

  • Top-k は、人気のあるサンプリング戦略です。累積確率が確率 P を超える最小の単語のセットから選択する Top-P と比較して、Top-K サンプリングでは、次の可能性が最も高い K 個の単語がフィルタリングされ、確率の集合が K 個の次の単語のみに再分配されます。猫の例では、k=3 の場合、“playing”、“sleeping” および “eating” だけが次の単語として考慮されます。

  • Repetition Penalty このパラメーターは、入力プロンプトを含むテキスト内でのトークンの出現頻度に基づいてトークンにペナルティーを与えるのに役立ちます。5 回出現したトークンは、1 回しか出現しなかったトークンよりも重いペナルティーが課されます。値 1 はペナルティーがないことを意味し、値が 1 より大きい場合、トークンの反復が妨げられます。

これらは、高度な生成オプションのセクションで変更できます。

import torch 
from threading import Event, Thread 
from uuid import uuid4 
from typing import List, Tuple 
import gradio as gr 
from transformers import ( 
    AutoTokenizer, 
    StoppingCriteria, 
    StoppingCriteriaList, 
    TextIteratorStreamer, 
) 

model_name = "stable-zephyr-3b" 

tok = AutoTokenizer.from_pretrained(stateful_model_path) 

DEFAULT_SYSTEM_PROMPT = """\ 
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\ 
""" 

model_configuration = { 
    "start_message": f"<|system|>\n {DEFAULT_SYSTEM_PROMPT }<|endoftext|>", 
    "history_template": "<|user|>\n{user}<|endoftext|><|assistant|>\n{assistant}<|endoftext|>", 
    "current_message_template": "<|user|>\n{user}<|endoftext|><|assistant|>\n{assistant}", 
} 
history_template = model_configuration["history_template"] 
current_message_template = model_configuration["current_message_template"] 
start_message = model_configuration["start_message"] 
stop_tokens = model_configuration.get("stop_tokens") 
tokenizer_kwargs = model_configuration.get("tokenizer_kwargs", {}) 

examples = [ 
    ["Hello there! How are you doing?"], 
    ["What is OpenVINO?"], 
    ["Who are you?"], 
    ["Can you explain to me briefly what is Python programming language?"], 
    ["Explain the plot of Cinderella in a sentence."], 
    ["What are some common mistakes to avoid when writing code?"], 
    ["Write a 100-word blog post on “Benefits of Artificial Intelligence and OpenVINO“"], 
] 

max_new_tokens = 256 

class StopOnTokens(StoppingCriteria): 
    def __init__(self, token_ids): 
        self.token_ids = token_ids 

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool: 
        for stop_id in self.token_ids: 
            if input_ids[0][-1] == stop_id: 
                return True 
        return False 

if stop_tokens is not None: 
    if isinstance(stop_tokens[0], str): 
        stop_tokens = tok.convert_tokens_to_ids(stop_tokens) 
    stop_tokens = [StopOnTokens(stop_tokens)] 

def default_partial_text_processor(partial_text: str, new_text: str): 
    """ 
    helper for updating partially generated answer, used by default 

    Params: 
        partial_text: text buffer for storing previosly generated text 
        new_text: text update for the current step 
    Returns: 
        updated text string 

    """ 
    partial_text += new_text 
    return partial_text 

text_processor = model_configuration.get("partial_text_processor", default_partial_text_processor) 

def convert_history_to_text(history: List[Tuple[str, str]]): 
    """ 
    function for conversion history stored as list pairs of user and assistant messages to string according to model expected conversation template 
    Params: 
        history: dialogue history 
    Returns: 
        history in text format 
    """ 
    text = start_message + "".join(["".join([history_template.format(num=round, user=item[0], assistant=item[1])]) for round, item in enumerate(history[:-1])]) 
    text += "".join( 
        [ 
            "".join( 
                [ 
                    current_message_template.format( 
                        num=len(history) + 1, 
                        user=history[-1][0], 
                        assistant=history[-1][1], 
                    ) 
                ] 
            ) 
        ] 
    ) 
    return text 

def user(message, history): 
    """ 
    callback function for updating user messages in interface on submit button click 

    Params: 
        message: current message 
        history: conversation history 
    Returns:
        None 
    """ 
    # ユーザーのメッセージを会話履歴に追加 
    return "", history + [[message, ""]] 

def bot(history, temperature, top_p, top_k, repetition_penalty, conversation_id): 
    """ 
    callback function for running chatbot on submit button click 

    Params: 
        history: conversation history 
        temperature: parameter for control the level of creativity in AI-generated text.
                     By adjusting the `temperature`, you can influence the AI model's probability distribution, making the text more focused or diverse. 
        top_p: parameter for control the range of tokens considered by the AI model based on their cumulative probability. 
        top_k: parameter for control the range of tokens considered by the AI model based on their cumulative probability, selecting number of tokens with highest probability. 
        repetition_penalty: parameter for penalizing tokens based on how frequently they occur in the text. 
        conversation_id: unique conversation identifier.
    """ 

    # 現在のシステムメッセージと会話履歴を連結して、モデルの入力メッセージ文字列を構築 
    messages = convert_history_to_text(history) 

    # メッセージの文字列をトークン化 
    input_ids = tok(messages, return_tensors="pt", **tokenizer_kwargs).input_ids 
    if input_ids.shape[1] > 2000: 
        history = [history[-1]] 
        messages = convert_history_to_text(history) 
        input_ids = tok(messages, return_tensors="pt", **tokenizer_kwargs).input_ids 
    streamer = TextIteratorStreamer(tok, timeout=30.0, skip_prompt=True, skip_special_tokens=True) 
    generate_kwargs = dict( 
        input_ids=input_ids, 
        max_new_tokens=max_new_tokens, 
        temperature=temperature, 
        do_sample=temperature > 0.0, 
        top_p=top_p, 
        top_k=top_k, 
        repetition_penalty=repetition_penalty, 
        streamer=streamer, 
    ) 
    if stop_tokens is not None: 
        generate_kwargs["stopping_criteria"] = StoppingCriteriaList(stop_tokens) 

    stream_complete = Event() 

def generate_and_signal_complete(): 
    """ 
    genration function for single thread 
    """ 
    global start_time 
    ov_model.generate(**generate_kwargs) 
    stream_complete.set() 

t1 = Thread(target=generate_and_signal_complete) 
t1.start() 

# 生成されたテキストを保存するために空の文字列を初期化 
partial_text = "" 
for new_text in streamer: 
    partial_text = text_processor(partial_text, new_text) 
    history[-1][1] = partial_text 
    yield history 

def get_uuid(): 
    """ 
    universal unique identifier for thread 
    """ 
    return str(uuid4()) 

with gr.Blocks( 
    theme=gr.themes.Soft(), 
    css=".disclaimer {font-variant-caps: all-small-caps;}", 
) as demo: 
    conversation_id = gr.State(get_uuid) 
    gr.Markdown(f"""<h1><center>OpenVINO {model_name} Chatbot</center></h1>""") 
    chatbot = gr.Chatbot(height=500) 
    with gr.Row(): 
        with gr.Column(): 
            msg = gr.Textbox( 
                label="Chat Message Box", 
                placeholder="Chat Message Box", 
                show_label=False, 
                container=False, 
            ) 
        with gr.Column(): 
            with gr.Row(): 
                submit = gr.Button("Submit") 
                stop = gr.Button("Stop") 
                clear = gr.Button("Clear") 
    with gr.Row(): 
        with gr.Accordion("Advanced Options:", open=False): 
            with gr.Row(): 
                with gr.Column(): 
                    with gr.Row(): 
                        temperature = gr.Slider( 
                            label="Temperature", 
                            value=0.1, 
                            minimum=0.0, 
                            maximum=1.0, 
                            step=0.1, 
                            interactive=True, 
                            info="Higher values produce more diverse outputs", 
                        ) 
                with gr.Column(): 
                    with gr.Row(): 
                        top_p = gr.Slider( 
                            label="Top-p (nucleus sampling)", 
                            value=1.0, 
                            minimum=0.0, 
                            maximum=1, 
                            step=0.01, 
                            interactive=True, 
                            info=( 
                                "Sample from the smallest possible set of tokens whose cumulative probability " 
                                "exceeds top_p.Set to 1 to disable and sample from all tokens."                             ), 
                        ) 
                with gr.Column(): 
                    with gr.Row(): 
                        top_k = gr.Slider( 
                            label="Top-k", 
                            value=50, 
                            minimum=0.0, 
                            maximum=200, 
                            step=1, 
                            interactive=True, 
                            info="Sample from a shortlist of top-k tokens — 0 to disable and sample from all tokens.", 
                        ) 
                with gr.Column(): 
                    with gr.Row(): 
                        repetition_penalty = gr.Slider( 
                            label="Repetition Penalty", 
                            value=1.1, 
                            minimum=1.0, 
                            maximum=2.0, 
                            step=0.1, 
                            interactive=True, 
                            info="Penalize repetition — 1.0 to disable.", 
                        ) 
    gr.Examples(examples, inputs=msg, label="Click on any example and press the 'Submit' button") 

    submit_event = msg.submit( 
        fn=user, 
        inputs=[msg, chatbot], 
        outputs=[msg, chatbot], 
        queue=False, 
    ).then( 
        fn=bot, 
        inputs=[ 
            chatbot, 
            temperature, 
            top_p, 
            top_k, 
            repetition_penalty, 
            conversation_id, 
        ], 
        outputs=chatbot, 
        queue=True, 
    ) 
    submit_click_event = submit.click( 
        fn=user, 
        inputs=[msg, chatbot], 
        outputs=[msg, chatbot], 
        queue=False, 
    ).then( 
        fn=bot, 
        inputs=[ 
            chatbot, 
            temperature, 
            top_p, 
            top_k, 
            repetition_penalty, 
            conversation_id, 
        ], 
        outputs=chatbot, 
        queue=True, 
    ) 
    stop.click( 
        fn=None, 
        inputs=None, 
        outputs=None, 
        cancels=[submit_event, submit_click_event], 
        queue=False, 
    ) 
    clear.click(lambda: None, None, chatbot, queue=False) 

demo.queue(max_size=2) 
# リモートで起動する場合は、server_name と server_port を指定 
# demo.launch(server_name='your server name', server_port='server port in int') 
# プラットフォーム上で起動する際に問題がある場合は、起動メソッドに share=True を渡すことができます: 
# demo.launch(share=True) 
# インターフェイスの公開共有可能なリンクを作成。詳細はドキュメントをご覧ください: https://gradio.app/docs/ 
demo.launch(share=True)