[非推奨] GPT-J 因果言語モデリングのデモ¶

このデモは非推奨となっており、2024.0 で削除される予定です。言語モデルを使用した最新の例については、新しい Python デモをチェックしてください。

はじめに¶

このデモでは、OpenVINO™ モデルサーバーでの GPT などのモデルの使用法を示します。この例で使用されている GPT-J 6B モデルは、huggingface (~25GB) にあります。以下の手順では、ダウンロードと変換のステップを自動化し、OpenVINO™ を使用してロードできるようにします。ドキュメントの最後にあるサンプル Python クライアントは、EOS (シーケンスの終わり) トークンが受信されるまで、モデルサーバーに文の次の単語を要求します。

モデルのダウンロード¶

環境を準備します。

                                    git clone https://github.com/openvinotoolkit/model_server.git
cd model_server/demos/gptj_causal_lm/python
virtualenv .venv
source .venv/bin/activate
pip install -r requirements.txt

                                

huggingface から GPT-J-6b モデルをダウンロードし、次のスクリプトで PyTorch 形式でディスクに保存します。

注: モデルが約 25GB あるため、最初のダウンロードには時間がかかる場合があります。後続のスクリプトでは、~/.cache/huggingface ディレクトリーにあるキャッシュのモデルが使用されます。

python3 download_model.py

スクリプトは、transformers pip ライブラリーを使用してモデルをダウンロードし、pytorch バックエンドを使用してメモリーにロードし、PyTorch 形式でディスクに保存します。

注: モデルを CPU デバイスにロードするには、最大 48 GB の RAM が必要です。詳しくはモデルの仕様を参照してください。

モデルを変換¶

OVMS にロードするには、モデルを IR 形式に変換する必要があります。

                                    chmod +x convert_model.sh && ./convert_model.sh

                                

モデルは、model/1 ディレクトリーに配置されます。

準備された GPT-J-6b モデルで OVMS を開始¶

                                    docker run -d --rm -p 9000:9000 -v $(pwd)/model:/model:ro openvino/model_server \
    --port 9000 \
    --model_name gpt-j-6b \
    --model_path /model \
    --plugin_config '{"PERFORMANCE_HINT":"LATENCY","NUM_STREAMS":1}'

                                

インタラクティブな OVMS デモ¶

app.py スクリプトを実行して、文末トークンが検出されるまでループ内で次の単語を予測する対話型デモを実行します。

                                    python3 app.py --url localhost:9000 --model_name gpt-j-6b --input "Neurons are fascinating"

                                

出力:

                                    Neurons are fascinating cells that are able to communicate with each other and with other cells in the body. Neurons are the cells that make up the nervous system, which is responsible for the control of all body functions. Neurons are also responsible for the transmission of information from one part of the body to another.
Number of iterations: 62
First latency: 0.37613916397094727s
Last latency: 1.100903034210205s

                                

精度の検証¶

OVMS 単純なクライアント・スクリプトを実行¶

スクリプトは入力例などの生の出力を表示します。

                                        python3 infer_ovms.py --url localhost:9000 --model_name gpt-j-6b

                                    

望まれる出力:

                                        [[[ 8.407803   7.2024884  5.114844  ... -6.691438  -6.7890754 -6.6537027]
  [ 6.97011    9.89741    8.216569  ... -3.891536  -3.6937592 -3.6568289]
  [ 8.199201  10.721757   8.502647  ... -6.340912  -6.247861  -6.1362333]
  [ 6.5459595 10.398776  11.310042  ... -5.9843545 -5.806437  -6.0776973]
  [ 8.934336  13.137416   8.568134  ... -6.835008  -6.7942514 -6.6916494]
  [ 5.1626735  6.062623   1.7213026 ... -7.789153  -7.568969  -7.6591196]]]
predicted word:  a

                                    

PyTorch で推論を実行¶

PyTorch を使用して推論を実行し、結果を比較します。

python3 infer_torch.py

出力:

                                        tensor([[[ 8.4078,  7.2025,  5.1148,  ..., -6.6914, -6.7891, -6.6537],
         [ 6.9701,  9.8974,  8.2166,  ..., -3.8915, -3.6938, -3.6568],
         [ 8.1992, 10.7218,  8.5026,  ..., -6.3409, -6.2479, -6.1362],
         [ 6.5460, 10.3988, 11.3100,  ..., -5.9844, -5.8064, -6.0777],
         [ 8.9343, 13.1374,  8.5681,  ..., -6.8350, -6.7943, -6.6916],
         [ 5.1627,  6.0626,  1.7213,  ..., -7.7891, -7.5690, -7.6591]]],
       grad_fn=<ViewBackward0>)
predicted word:  a

                                    

サーバー側のトークン化とトークン化解除を行うパイプライン・モード¶

このバリアントは、トークン化とトークン化解除のステップをクライアントからサーバーにオフロードします。OVMS は、文字列プロトを 2D U8 テンソルに変換し、データをトークン化カスタムノードに渡すことができます。このよう gpt-j-6b モデルのトークンを自動生成し、確率ベクトルではなくテキストとして応答を取得します。

diagram

環境の準備¶

make コマンドを使用して、カスタム・ノード・ライブラリー、ブリングファイア・トークン化モデル、構成ファイルを準備します。

このデモで使用されるカスタムノードは OpenVINO モデルサーバーのイメージに含まれているため、イメージからカスタムノードを使用するか、カスタムノードをビルドすることができます。

このデモを実行して、コンパイル済みのカスタムノードを使用する場合、次を実行します。

make

ワークスペースは次のようになります。

                                    tree workspace 
workspace
├── config.json
└── tokenizers
    ├── gpt2.bin
    └── gpt2.i2w

1 directory, 3 files

                                

(オプション) カスタムノードを変更した場合、またはその他の理由で、カスタムノードをコンパイルしてコンテナにアタッチしたい場合は、次を実行します。

                                    make BUILD_CUSTOM_NODE=true BASE_OS=ubuntu

                                

ワークスペースは次のようになります。

                                    workspace
├── config.json
├── lib
│   ├── libdetokenizer.so
│   └── libtokenizer.so
└── tokenizers
    ├── gpt2.bin
    └── gpt2.i2w

                                

準備されたワークスペースを使用して OVMS を開始します。

                                    docker run -d --rm -p 9000:9000 \
    -v $(pwd)/model:/onnx:ro \
    -v $(pwd)/workspace:/workspace:ro \
    openvino/model_server \
    --port 9000 \
    --config_path /workspace/config.json

                                

TensorFlow Serving API をインストールします。

                                    pip install --upgrade pip
pip install tensorflow-serving-api==2.11.0

サンプル・クライアントを実行します。

                                    python3 dag_client.py --url localhost:9000 --model_name my_gpt_pipeline --input "Neurons are fascinating"
b'Neurons are fascinating cells that are responsible for the transmission of information from one brain region to another. They are also responsible for the production of hormones and neurotransmitters that are responsible for the regulation of mood, sleep, appetite, and sexual function.\n'

demo