OpenVINO を使用した LLM 指示追従パイプライン#

この Jupyter ノートブックは、ローカルへのインストール後にのみ起動できます。

LLM は “Large Language Model” の略で、受け取った入力に基づいて人間のようにテキストを理解して、生成するように設計された人工知能モデルの一種です。LLM は、大規模なテキスト・データセットでトレーニングされ、パターン、文法、意味関係を学習し、一貫した文脈に関連した応答を生成できるようにします。大規模言語モデル (LLM) の中核となる機能の 1 つは、自然言語の指示に従うことです。指示追従モデルは、プロンプトに応じてテキストを生成でき、執筆支援、チャットボット、コンテンツ生成などのタスクによく使用されます。

このチュートリアルでは、一般的な LLM と OpenVINO を使用して、指示に従うテキスト生成パイプラインを実行する方法を検討します。Hugging Face トランスフォーマー・ライブラリーの事前トレーニング済みモデルを使用します。ユーザー・エクスペリエンスを簡素化するために、Hugging Face の Optimum Intel ライブラリーを使用してモデルを OpenVINO™ IR 形式に変換します。ユーザー・エクスペリエンスを簡素化するために、指示追従推論パイプラインの生成に OpenVINO 生成 API を使用します。

このチュートリアルは次のステップで構成されます:

前提条件のインストール
OpenVINO と Hugging Face Optimum の統合を使用して、パブリックソースからモデルをダウンロードして変換します。
OpenVINO NNCF を使用してモデルの重みを INT8 および INT4 に圧縮
Generate API を使用した指示追従推論パイプラインを作成
指示追従パイプラインを実行

目次:

要件
推論用のモデルを選択
Optimum Intel CLI 経由でモデルをダウンロードして OpenVINO IR に変換
モデルの重みを圧縮
- Optimum Intel CLI を使用した重み圧縮
- NNCF を使用した重み圧縮
推論用のデバイスとモデルバリアントを選択
指示追従推論パイプラインを作成
指示追従パイプラインを実行

必要条件#

%pip uninstall -q -y optimum optimum-intel 
%pip install -q "openvino-genai>=2024.2" 
%pip install -q "torch>=2.1" "nncf>=2.7" "transformers>=4.40.0" onnx "optimum>=1.16.1" "accelerate" "datasets>=2.14.6" "gradio>=4.19" "git+https://github.com/huggingface/optimum-intel.git" --extra-index-url https://download.pytorch.org/whl/cpu

WARNING: Skipping optimum as it is not installed.
WARNING: Skipping optimum-intel as it is not installed.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

推論用のモデルを選択#

このチュートリアルではさまざまなモデルがサポートされており、提供されたオプションから 1 つを選択して、オープンソース LLM ソリューションの品質を比較できます。>注: 一部のモデルの変換には、ユーザーによる追加アクションが必要になる場合があり、変換には少なくとも 64GB RAM が必要です。

利用可能なオプションには以下があります:

tiny-llama-1b-chat - これは、TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T をベースに微調整されたチャットモデルです。TinyLlama プロジェクトは、Llama 2 と同じアーキテクチャーとトークナイザーを採用し、3 兆個のトークンで 11 億個の Llama モデルを事前トレーニングすることを目的としています。これは、TinyLlama を、Llama 上に構築された多くのオープンソース・プロジェクトにプラグインできることを意味します。さらに、TinyLlama はコンパクトで、パラメーターは 1.1B のみです。このコンパクトさにより、制限された計算量とメモリー使用量を要求する多くのアプリケーションに対応できます。モデルの詳細については、モデルカードをご覧ください
phi-2 - Phi-2 は 27 億のパラメーターを持つトランスフォーマーです。これは、Phi-1.5 と同じデータソースを使用してトレーニングされ、さまざまな NLP 合成テキストとフィルター処理されたウェブサイト (安全性と教育的価値のため) で構成される新しいデータソースで強化されました。常識、言語理解、論理的推論をテストするベンチマークに対して評価したところ、Phi-2 は 130 億未満のパラメーターを持つモデルの中でほぼ最高のパフォーマンスを示しました。モデルの詳細については、モデルカードをご覧ください。
dolly-v2-3b - Dolly 2.0 は、商用利用が許可されている Databricks マシン・ラーニング・プラットフォームでトレーニングされた指示追従大規模言語モデルです。Pythia は、ブレインストーミング、分類、クローズド QA、生成、情報抽出、オープン QA、要約など、さまざまな機能ドメインで Databricks の従業員によって生成された約 15,000 の指示/応答の微調整レコードに基づいてトレーニングされています。Dolly 2.0 は、自然言語命令を処理し、指定された指示に従う応答を生成することによって機能します。クローズド質疑応答、要約、生成など幅広い用途に使用できます。モデルの詳細については、モデルカードをご覧ください。
red-pajama-3b-instruct - GPT-NEOX アーキテクチャーに基づいた 2.8B パラメーターの事前トレーニング済み言語モデル。モデルは、HELM コアシナリオと重複するタスクを除外して、GPT-JT データの数ショット・アプリケーション向けに微調整されました。モデルの詳細については、モデルカードを参照してください。
mistral-7b - Mistral-7B-v0.2 大規模言語モデル (LLM) は、70 億のパラメーターを備えた事前トレーニング済みの生成テキストモデルです。モデルの詳細については、モデルカード、論文およびリリースのブログポストを参照してください。
llama-3-8b-instruct - Llama 3 は、最適化されたトランスフォーマー・アーキテクチャーを使用する自己回帰言語モデルです。調整されたバージョンでは、教師あり微調整 (SFT) と人間によるフィードバックによる強化学習 (RLHF) を使用して、有用性と安全性に関する嗜好に合わせます。Llama 3 指示調整モデルは、対話ユースケース向けに最適化されており、一般的な業界ベンチマークにおいて、多くのオープンソース・チャット・モデルよりも優れたパフォーマンスを発揮します。モデルの詳細については、Meta のブログ投稿、モデルのウェブサイトおよびモデルカードを参照してください>注: デモでモデルを実行するには、ライセンス契約に同意する必要があります。>Hugging Face Hub の登録ユーザーである必要があります。HuggingFace モデルカードにアクセスし、利用規約をよく読み、同意ボタンをクリックしてください。以下のコードを実行するには、アクセストークンを使用する必要があります。アクセストークンの詳細については、ドキュメントのこのセクションを参照してください。次のコードを使用して、ノートブック環境の Hugging Face Hub にログインできます:

## 事前トレーニング済みモデルにアクセスするには、huggingfacehub にログインします 

from huggingface_hub import notebook_login, whoami 

try: 
    whoami() 
    print('Authorization token already provided') 
except OSError: 
    notebook_login()

from pathlib import Path 
import requests 

# `notebook_utils` モジュールを取得 
r = requests.get( 
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py", 
) 
open("notebook_utils.py", "w").write(r.text) 
from notebook_utils import download_file 

if not Path("./config.py").exists(): 

download_file(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/llm-question-answering/config.py") 
from config import SUPPORTED_LLM_MODELS 
import ipywidgets as widgets

model_ids = list(SUPPORTED_LLM_MODELS) 

model_id = widgets.Dropdown( 
    options=model_ids, 
    value=model_ids[1], 
    description="Model:", 
    disabled=False, 
) 

model_id

Dropdown(description='Model:', index=1, options=('tiny-llama-1b', 'phi-2', 'dolly-v2-3b', 'red-pajama-instruct…

model_configuration = SUPPORTED_LLM_MODELS[model_id.value] 
print(f"Selected model {model_id.value}")

Selected model dolly-v2-3b

Optimum Intel CLI 経由でモデルをダウンロードして OpenVINO IR に変換#

モデルは、HuggingFace ハブからダウンロードできます。OpenVINO 中間表現 (IR) 形式にエクスポートするには、optimal-cli インターフェイスを使用します。

モデルを変換するための Optimum CLI インターフェイスは、OpenVINO へのエクスポートをサポートします (optimum-intel バージョン 1.12 以降でサポートされます)。一般的なコマンド形式:

optimum-cli export openvino --model <model_id_or_path> --task <task> <output_dir>

--model 引数は、HuggingFace Hub またはモデルのあるローカル・ディレクトリー (.save_pretrained メソッドを使用して保存) のモデル ID であり、--task は、エクスポートされたモデルが解決する必要があるサポートされているタスクの 1 つです。LLM の場合は、text-generation-with-past になります。モデルの初期化にリモートコードを使用する場合は、--trust-remote-code フラグを渡す必要があります。サポートされている引数の完全なリストは --help で参照できます。詳細と使用例については、optimum ドキュメントを確認してください。

モデルの重みを圧縮#

重み圧縮アルゴリズムは、モデルの重みを圧縮することを目的としており、大規模言語モデル (LLM) など、重みのサイズが活性化のサイズよりも相対的に大きい大規模モデルのモデル・フットプリントとパフォーマンスを最適化するために使用できます。INT8 圧縮と比較して、INT4 圧縮はパフォーマンスをさらに向上させますが、予測品質は若干低下します。

Optimum Intel CLI を使用した重み圧縮#

Optimum Intel は、NNCF による重量圧縮をサポートします。8 ビット圧縮の場合、--weight-format int8 を optimum-cli コマンドラインに渡します。4 ビット圧縮の場合、--weight-format int4 と、ビット数やその他の圧縮パラメーターを含むその他のオプションが提供されています。このアプローチの使用例は、llm-chatbot ノートブックにあります。

NNCF を使用した重み圧縮#

NNCF を直接使用して、OpenVINO モデルの重み圧縮を実行することもできます。nncf.compress_weights 関数は、OpenVINO モデル・インスタンスを受け入れ、線形レイヤーと埋め込みレイヤーの重みを圧縮します。このノートブックでは、int4 圧縮と int8 圧縮の両方のバリアントを検討します。

注: このチュートリアルには、FP16 および INT4/INT8 重み圧縮シナリオの変換モデルが含まれます。最初の実行ではメモリーを消費し時間がかかる可能性があります。以下で圧縮精度を手動で制御できます。注: dGPU 上の INT4/INT8 圧縮モデルではスピードアップされない可能性があります。

from IPython.display import display, Markdown 

prepare_int4_model = widgets.Checkbox( 
    value=True, 
    description="Prepare INT4 model", 
    disabled=False, 
) 
prepare_int8_model = widgets.Checkbox( 
    value=False, 
    description="Prepare INT8 model", 
    disabled=False, 
) 
prepare_fp16_model = widgets.Checkbox( 
    value=False, 
    description="Prepare FP16 model", 
    disabled=False, 
) 

display(prepare_int4_model) 
display(prepare_int8_model) 
display(prepare_fp16_model)

Checkbox(value=True, description='Prepare INT4 model')

Checkbox(value=False, description='Prepare INT8 model')

Checkbox(value=False, description='Prepare FP16 model')

from pathlib import Path 
import logging 
import openvino as ov 
import nncf 

nncf.set_log_level(logging.ERROR) 

pt_model_id = model_configuration["model_id"] 
fp16_model_dir = Path(model_id.value) / "FP16" 
int8_model_dir = Path(model_id.value) / "INT8_compressed_weights" 
int4_model_dir = Path(model_id.value) / "INT4_compressed_weights" 

core = ov.Core() 

def convert_to_fp16(): 
    if (fp16_model_dir / "openvino_model.xml").exists(): 
        return 
    export_command_base = "optimum-cli export openvino --model {} --task text-generation-with-past --weight-format fp16".format(pt_model_id) 
    export_command = export_command_base + " " + str(fp16_model_dir) 
    display(Markdown("**Export command:**")) 
    display(Markdown(f"`{export_command}`")) 
    ! $export_command 

def convert_to_int8(): 
    if (int8_model_dir / "openvino_model.xml").exists(): 
        return 
    int8_model_dir.mkdir(parents=True, exist_ok=True) 
    export_command_base = "optimum-cli export openvino --model {} --task text-generation-with-past --weight-format int8".format(pt_model_id) 
    export_command = export_command_base + " " + str(int8_model_dir) 
    display(Markdown("**Export command:**")) 
    display(Markdown(f"`{export_command}`")) 
    ! $export_command 

def convert_to_int4(): 
    compression_configs = { 
        "mistral-7b": { 
            "sym": True, 
            "group_size": 64, 
            "ratio": 0.6, 
        }, 
        "red-pajama-3b-instruct": { 
            "sym": False, 
            "group_size": 128, 
            "ratio": 0.5, 
        }, 
        "dolly-v2-3b": {"sym": False, "group_size": 32, 
            "ratio": 0.5}, 
        "llama-3-8b-instruct": {"sym": True, "group_size": 128, "ratio": 1.0}, 
        "default": { 
            "sym": False, 
            "group_size": 128, 
            "ratio": 0.8, 
        }, 
    } 

    model_compression_params = compression_configs.get(model_id.value, compression_configs["default"]) 
    if (int4_model_dir / "openvino_model.xml").exists(): 
        return 
    export_command_base = "optimum-cli export openvino --model {} --task text-generation-with-past --weight-format int4".format(pt_model_id) 
    int4_compression_args = " --group-size {} --ratio {}".format(model_compression_params["group_size"], model_compression_params["ratio"]) 
    if model_compression_params["sym"]: 
        int4_compression_args += " --sym" 
    export_command_base += int4_compression_args 
    export_command = export_command_base + " " + str(int4_model_dir) 
    display(Markdown("**Export command:**")) 
    display(Markdown(f"`{export_command}`")) 
    ! $export_command 

if prepare_fp16_model.value: 
    convert_to_fp16() 
if prepare_int8_model.value: 
    convert_to_int8() 
if prepare_int4_model.value: 
    convert_to_int4()

INFO:nncf:NNCF initialized successfully.Supported frameworks detected: torch, onnx, openvino

さまざまな圧縮タイプのモデルサイズを比較してみましょう

fp16_weights = fp16_model_dir / "openvino_model.bin" 
int8_weights = int8_model_dir / "openvino_model.bin" 
int4_weights = int4_model_dir / "openvino_model.bin" 

if fp16_weights.exists(): 
    print(f"Size of FP16 model is {fp16_weights.stat().st_size / 1024 / 1024:.2f} MB") 
for precision, compressed_weights in zip([8, 4], [int8_weights, int4_weights]): 
    if compressed_weights.exists(): 
        print(f"Size of model with INT{precision} compressed weights is {compressed_weights.stat().st_size / 1024 / 1024:.2f} MB") 
    if compressed_weights.exists() and fp16_weights.exists(): 
        print(f"Compression rate for INT{precision} model: {fp16_weights.stat().st_size / compressed_weights.stat().st_size:.3f}")

Size of FP16 model is 5297.21 MB 
Size of model with INT8 compressed weights is 2656.29 MB 
Compression rate for INT8 model: 1.994 
Size of model with INT4 compressed weights is 2154.54 MB 
Compression rate for INT4 model: 2.459

推論用のデバイスとモデルバリアントを選択#

注: dGPU 上の INT4/INT8 圧縮モデルではスピードアップされない可能性があります。

core = ov.Core() 

support_devices = core.available_devices 
if "NPU" in support_devices: 
    support_devices.remove("NPU") 

device = widgets.Dropdown( 
    options=support_devices + ["AUTO"], 
    value="CPU", 
    description="Device:", 
    disabled=False, 
) 

device

Dropdown(description='Device:', options=('CPU', 'AUTO'), value='CPU')

available_models = [] 
if int4_model_dir.exists(): 
    available_models.append("INT4") 
if int8_model_dir.exists(): 
    available_models.append("INT8") 
if fp16_model_dir.exists(): 
    available_models.append("FP16") 

model_to_run = widgets.Dropdown( 
    options=available_models, 
    value=available_models[0], 
    description="Model to run:", 
    disabled=False, 
) 

model_to_run

Dropdown(description='Model to run:', options=('INT4', 'INT8', 'FP16'), value='INT4')

from transformers import AutoTokenizer 
from openvino_tokenizers import convert_tokenizer 

if model_to_run.value == "INT4": 
    model_dir = int4_model_dir 
elif model_to_run.value == "INT8": 
    model_dir = int8_model_dir 
else: 
    model_dir = fp16_model_dir 
print(f"Loading model from {model_dir}") 

# キャッシュされたモデルをトークナイザーなしで使用した場合、オプションでトークナイザーを変換します 
if not (model_dir / "openvino_tokenizer.xml").exists() or not (model_dir / "openvino_detokenizer.xml").exists(): 
    hf_tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) ov_tokenizer, 
    ov_detokenizer = convert_tokenizer(hf_tokenizer, with_detokenizer=True) 
    ov.save_model(ov_tokenizer, model_dir / "openvino_tokenizer.xml") 
    ov.save_model(ov_tokenizer, model_dir / "openvino_detokenizer.xml")

Loading model from dolly-v2-3b/INT8_compressed_weights

指示追従推論パイプラインを作成#

run_generation 関数は、ユーザー指定のテキスト入力を受け入れ、それをトークン化して、生成プロセスを実行します。テキスト生成は反復プロセスであり、トークンの最大数または生成停止条件に達するまで、次の各トークンは以前に生成されたものに依存します。

以下の図は、指示追従パイプラインがどのように動作するかを示しています

ご覧のとおり、最初の反復でユーザーが指示を提供しました。指示はトークナイザーによってトークン ID に変換され、準備された入力がモデルに提供されます。モデルは、すべてのトークンの確率をロジット形式で生成します。予測される確率に基づいて次のトークンが選択される方法は、選択されたデコード方法によって決まります。最も一般的なデコード方法の詳細については、このブログをご覧ください。

ユーザー・エクスペリエンスを簡素化するため、OpenVINO 生成 API を使用します。最初に、LLMPipeline を使用してパイプラインを作成します。LLMPipeline は、デコードに使用されるメイン・オブジェクトです。変換されたモデルを含むフォルダーから直接構築できます。メインモデル、トークナイザー、デトークナイザー、およびデフォルトの生成構成が自動的に読み込まれます。その後、デコード用のパラメーターを設定します。get_generation_config() を使用してデフォルトの設定を取得し、パラメーターを設定して set_generation_config(config) を使用して更新されたバージョンを適用するか、設定を直接 generate() に渡すことができます。以下に示すように、必要なオプションを generate() メソッドの入力として指定することもできます。次に、generate メソッドを実行して、テキスト形式で出力を取得します。モデルの予想されるテンプレートに従って入力プロンプトをエンコードしたり、logits デコーダーの後処理コードを記述する必要はありません。LLMPipeline を使用すると簡単に実行できます。

生成が完了するまで待たずに中間生成結果を取得するには、openvino_genai の StreamerBase クラスに基づいてクラス・イテレーターを記述します。

from openvino_genai import LLMPipeline 

pipe = LLMPipeline(model_dir.as_posix(), device.value) 
print(pipe.generate("The Sun is yellow bacause", temperature=1.2, top_k=4, do_sample=True, max_new_tokens=150))

 of the presence of chlorophyll in its leaves. Chlorophyll absorbs all visible sunlight and this causes it to turn from a green to yellow colour. The Sun is yellow bacause of the presence of chlorophyll in its leaves. Chlorophyll absorbs all visible sunlight and this causes it to turn from a green to yellow colour. The yellow colour of the Sun is the colour we perceive as the colour of the sun. It also causes us to perceive the sun as yellow. This property is called the yellow colouration of the Sun and it is caused by the presence of chlorophyll in the leaves of plants. Chlorophyll is also responsible for the green colour of plants

テキスト生成の品質を制御できるパラメーターがいくつかあります:

Temperature は、AI が生成したテキストの創造性のレベルを制御するパラメーターです。temperature を調整することで、AI モデルの確率分布に影響を与え、テキストの焦点を絞ったり、多様にしたりできます。

次を考えてみます。AI モデルは次のトークンの確率で “The cat is ____ .” という文を完成させる必要があります:

playing: 0.5

sleeping: 0.25

eating: 0.15

driving: 0.05

flying: 0.05
- 低温度 (例 0.2): AI モデルはより集中的かつ決定的になり、“遊んでいる” どの最も高い確率でトークンを選択します。
- 中温度 (例 1.0): AI モデルは創造性と集中力のバランスを維持し、“遊んでいる”、“眠っている”、“食べている” どの大きな偏りのない確率に基づいてトークンを選択します。
- 髙温度 (例 2.0): AI モデルはより冒険的になり、“ドライブしている” や “飛んでいる” ど、可能性の低いトークンを選択する可能性が高くなります。
Top-p は、核サンプリングとも呼ばれる累積確率に基づいて AI モデルによって考慮される、トークンの範囲を制御するパラメーターです。top-p 値を調整することで、AI モデルのトークン選択に影響を与え、焦点を絞ったり、多様性を持たせることができます。猫と同じ例を使用して、次の top_p 設定を検討してください:
- 低 top_p (例 0.5): AI モデルは、 “playing” など、累積確率が最も高いトークンのみを考慮します。
- 中 top_p (例 0.8): AI モデルは、“playing”、“sleeping”、“eating” など、累積確率がより高いトークンを考慮します。
- 高 top_p (例 1.0): AI モデルは、“driving” や “flying” などの確率の低いトークンを含むすべてのトークンを考慮します。
Top-k は、人気のあるサンプリング戦略です。累積確率が確率 P を超える最小の単語のセットから選択する Top-P と比較して、Top-K サンプリングでは、次の可能性が最も高い K 個の単語がフィルタリングされ、確率の集合が K 個の次の単語のみに再分配されます。猫の例では、k=3 の場合、“playing”、“sleeping” および “eating” だけが次の単語として考慮されます。

生成サイクルは、シーケンストークンの終わりに達するまで繰り返されます。または、最大のトークンが生成されるときに中断されることもあります。すでに述べたように、生成全体がストリーミング API を使用し、新しいトークンを出力キューに追加し、準備ができたらそれらをプリントするまで待つことなく、現在生成されたトークンをプリントできるようになります。

インポートの設定#

from threading import Thread 
from time import perf_counter 
from typing import List 
import gradio as gr 
import numpy as np 
from openvino_genai import StreamerBase 
from queue import Queue 
import re

実行時に結果を取得するテキスト・ストリーマーを準備#

デトークナイザーをロードし、それを使用して token_id を文字列出力形式に変換します。プリント可能なテキストをキューに収集し、必要なときにテキストを提供します。パフォーマンスの推測に役立ちます。

core = ov.Core() 

detokinizer_dir = Path(model_dir, "openvino_detokenizer.xml") 

class TextIteratorStreamer(StreamerBase): 
    def __init__(self, tokenizer): 
        super().__init__() 
        self.tokenizer = tokenizer 
        self.compiled_detokenizer = core.compile_model(detokinizer_dir.as_posix()) 
        self.text_queue = Queue() 
        self.stop_signal = None 

    def __iter__(self): 
        return self 

    def __next__(self): 
        value = self.text_queue.get() 
        if value == self.stop_signal: 
            raise StopIteration() 
        else: 
            return value 

    def put(self, token_id): 
        openvino_output = self.compiled_detokenizer(np.array([[token_id]], dtype=int)) 
        text = str(openvino_output["string_output"][0]) 
        # ラベル/特殊記号を削除 
        text = text.lstrip("!") 
        text = re.sub("<.*>", "", text) 
        self.text_queue.put(text) 

    def end(self): 
        self.text_queue.put(self.stop_signal)

メイン生成関数#

前述したように、run_generation 関数は生成を開始するエントリーポイントです。指定された入力指示をパラメーターとして取得し、モデル応答を返します。

def run_generation( 
    user_text: str, 
    top_p: float, 
    temperature: float, 
    top_k: int, 
    max_new_tokens: int, 
    perf_text: str, 
): 
    """ 
    Text generation function 

    Parameters: 
        user_text (str): User-provided instruction for a generation. 
        top_p (float): Nucleus sampling. If set to < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for a generation. 
        temperature (float): The value used to module the logits distribution. 
        top_k (int): The number of highest probability vocabulary tokens to keep for top-k-filtering. 
        max_new_tokens (int): Maximum length of generated sequence. 
        perf_text (str): Content of text field for printing performance results.
    Returns: 
        model_output (str) - model-generated text 
        perf_text (str) - updated perf text filed content 
    """ 

    # デコードステージの設定 
    config = pipe.get_generation_config() 
    config.temperature = temperature 
    if top_k > 0: 
        config.top_k = top_k 
    config.top_p = top_p 
    config.do_sample = True 
    config.max_new_tokens = max_new_tokens 

    # UI をブロックしないように、別のスレッドで生成を開始。テキストはメインスレッドの 
    # ストリーマーから取得されます。 

    streamer = TextIteratorStreamer(pipe.get_tokenizer()) 

    t = Thread(target=pipe.generate, args=(user_text, config, streamer)) 
    t.start()
 
    model_output = "" 
    per_token_time = [] 
    num_tokens = 0 
    start = perf_counter() 
    for new_text in streamer: 
        current_time = perf_counter() - start 
        model_output += new_text 
        perf_text, num_tokens = estimate_latency(current_time, perf_text, per_token_time, num_tokens) 
        yield model_output, perf_text 
        start = perf_counter() 
    return model_output, perf_text

アプリケーションのヘルパー#

インタラクティブなユーザー・インターフェイスを作成するには、Gradio ライブラリーを使用します。以下のコードは、UI 要素との通信に使用される便利な関数を提供します。

def estimate_latency( 
    current_time: float, 
    current_perf_text: str, 
    per_token_time: List[float], 
    num_tokens: int, 
): 
    """ 
    Helper function for performance estimation 

    Parameters: 
        current_time (float): This step time in seconds. 
        current_perf_text (str): Current content of performance UI field. 
        per_token_time (List[float]): history of performance from previous steps. 
        num_tokens (int): Total number of generated tokens. 
    Returns: 
        update for performance text field 
        update for a total number of tokens 
    """ 
    num_tokens += 1 
    per_token_time.append(1 / current_time) 
    if len(per_token_time) > 10 and len(per_token_time) % 4 == 0: 
        current_bucket = per_token_time[:-10] 
        return ( 
            f"Average generation speed: {np.mean(current_bucket):.2f} tokens/s. Total generated tokens: {num_tokens}", 
            num_tokens, 
        ) 
    return current_perf_text, num_tokens 

def reset_textbox(instruction: str, response: str, perf: str): 
    """ 
    Helper function for resetting content of all text fields 

    Parameters: 
        instruction (str): Content of user instruction field. 
        response (str): Content of model response field. 
        perf (str): Content of performance info filed 

    Returns: 
        empty string for each placeholder 
    """ 

    return "", "", ""

指示追従パイプラインを実行#

これで、モデルの機能を調べる準備が整いました。このデモは、テキスト指示を使用してモデルと通信できるシンプルなインターフェイスを提供します。[ユーザー指示] フィールドに指示を入力するか、事前定義された例から 1 つを選択し、[送信] ボタンをクリックして生成を開始します。さらに、高度な生成パラメーターに変更できます:

Device - 推論デバイスを切り替えることができます。新しいデバイスが選択されるたびにモデルが再コンパイルされるため、時間がかかることに注意してください。
Max New Tokens - 生成されるテキストの最大サイズ。
Top-p (nucleus sampling) - 1 に設定すると、合計が top_p 以上になる確率が高いトークンの最小セットのみが 1 回の生成にわたって保持されます。
Top-k - top-k フィルター処理のために保持する最も確率の高い語彙トークンの数。
Temperature - ロジット分布をモジュール化するために使用される値。

examples = [ 
    "Give me a recipe for pizza with pineapple", 
    "Write me a tweet about the new OpenVINO release", 
    "Explain the difference between CPU and GPU", 
    "Give five ideas for a great weekend with family", 
    "Do Androids dream of Electric sheep?", 
    "Who is Dolly?", 
    "Please give me advice on how to write resume?", 
    "Name 3 advantages to being a cat", 
    "Write instructions on how to become a good AI engineer", 
    "Write a love letter to my best friend", 
] 

with gr.Blocks() as demo: 
    gr.Markdown( 
        "# Question Answering with " + model_id.value + " and OpenVINO.\n" 
        "Provide instruction which describes a task below or select among predefined examples and model writes response that performs requested task."     ) 

    with gr.Row(): 
        with gr.Column(scale=4): 
            user_text = gr.Textbox( 
                placeholder="Write an email about an alpaca that likes flan", 
                label="User instruction", 
            ) 
            model_output = gr.Textbox(label="Model response", interactive=False) 
            performance = gr.Textbox(label="Performance", lines=1, interactive=False) 
            with gr.Column(scale=1): 
                button_clear = gr.Button(value="Clear") 
                button_submit = gr.Button(value="Submit") 
            gr.Examples(examples, user_text) 
        with gr.Column(scale=1): 
            max_new_tokens = gr.Slider( 
                minimum=1, 
                maximum=1000, 
                value=256, 
                step=1, 
                interactive=True, 
                label="Max New Tokens", 
            ) 
            top_p = gr.Slider( 
                minimum=0.05, 
                maximum=1.0, 
                value=0.92, 
                step=0.05, 
                interactive=True, 
                label="Top-p (nucleus sampling)", 
            ) 
            top_k = gr.Slider( 
                minimum=0, 
                maximum=50, 
                value=0, 
                step=1, 
                interactive=True, 
                label="Top-k", 
            ) 
            temperature = gr.Slider( 
                minimum=0.1, 
                maximum=5.0, 
                value=0.8, 
                step=0.1, 
                interactive=True, 
                label="Temperature", 
            ) 
    user_text.submit( 
        run_generation, 
        [user_text, top_p, temperature, top_k, max_new_tokens, performance], 
        [model_output, performance], 
    ) 
    button_submit.click( 
        run_generation, 
        [user_text, top_p, temperature, top_k, max_new_tokens, performance], 
        [model_output, performance], 
    ) 
    button_clear.click( 
        reset_textbox, 
        [user_text, model_output, performance], 
        [user_text, model_output, performance], 
    ) 

if __name__ == "__main__": 
    demo.queue() 
    try: 
        demo.launch(height=800) 
    except Exception: 
        demo.launch(share=True, height=800) 

# リモートで起動する場合は、server_name と server_port を指定 
# 例: `demo.launch(server_name='your server name', server_port='server port in int')` 
# 詳細については、Gradio のドキュメントを参照してください: https://gradio.app/docs/