Python ノードを使用した LLM テキスト生成#

注このテキスト生成デモは非推奨であり、連続バッチテキスト生成を備えた OpenAI API エンドポイントを使用した改良版の実装に置き換えられています。

次の新しいデモをチェックしてください:

このデモでは、OpenVINO モデルサーバーを利用して、LLM モデルを使用してコンテンツをリモートで生成する方法を示します。デモでは、Python ライブラリーを使用して MediaPipe グラフを提供する方法を説明します。実行エンジンとして OpenVINO ランタイムを備えた Hugging Face Optimum を使用します。次の 2 つの使用例が考えられます:

単項呼び出しの場合 - クライアントがグラフにプロンプトを送信し、処理の最後に生成された完全な応答を受信する場合
gRPC ストリーミングの場合 - クライアントがグラフにプロンプトを送信し、処理中に部分応答のストリームを受信するとき

単項呼び出しはより単純ですが、応答が完全に生成された場合にのみ返されるため、即時にフィードバックされません。

gRPC ストリームは、生成される応答をユーザーが読み取ることができることから、対話型のアプローチが必要な場合に優れた機能です。

このデモでは tiny-llama-1b-chat モデルを使用した使用例を示しますが、提供される Python スクリプトは他のいくつかの LLM モデル用でも使用できます。以下があります:

tiny-llama-1b-chat
llama-2-chat-7b
notus-7b-v1

このデモでは、モデルは次のように設定できます:

export SELECTED_MODEL=tiny-llama-1b-chat

要件:

Docker エンジンがインストールされ、モデルをロードするのに十分な RAM があり、オプションでインテル® GPU カードが装備された Linux* ホスト。このデモは、インテル® Xeon® Gold 6430 とインテル® データセンター GPU Flex 170 を搭載したホストでテストされました。tiny-llama-1b-chat のような小規模なモデルでデモを実行するには、約 4GB の使用可能な RAM が必要です。

イメージをビルド#

必要なすべての Python 依存関係を含むイメージをビルドする必要があります。次のコマンドに従います:

git clone https://github.com/openvinotoolkit/model_server.git 
cd model_server make python_image

openvino/model_server:py というイメージが作成されます

モデルのダウンロード#

download_model.py スクリプトを使用してモデルをダウンロードします:

cd demos/python_demos/llm_text_generation
pip install -r requirements.txt 

python download_model.py --help 
INFO:nncf:NNCF initialized successfully.Supported frameworks detected: torch, onnx, openvino 
usage: download_model.py [-h] --model {tiny-llama-1b-chat,llama-2-chat-7b,notus-7b-v1} 

Script to download LLM model based on https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/254-llm-chatbot 

options: 
  -h, --help show this help message and exit 
  --model {tiny-llama-1b-chat,llama-2-chat-7b,notus-7b-v1} 
                            Select the LLM model out of supported list 

python download_model.py --model ${SELECTED_MODEL}

モデルは ./tiny-llama-1b-chat ディレクトリーにあります。

重み圧縮 - オプション#

重み圧縮は元のモデルに適用される場合があります。8 ビットまたは 4 ビットの重み圧縮を適用すると、モデルのサイズとメモリー要件が緩和されると同時に、精度の低いレイヤーで計算を実行することで実行速度が向上します。

python compress_model.py --help 
INFO:nncf:NNCF initialized successfully.Supported frameworks detected: torch, onnx, openvino 
usage: compress_model.py [-h] --model {tiny-llama-1b-chat,llama-2-chat-7b,notus-7b-v1} 

Script to compress LLM model based on https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/254-llm-chatbot 

options: 
  -h, --help show this help message and exit 
  --model {tiny-llama-1b-chat,llama-2-chat-7b,notus-7b-v1} 
                           Select the LLM model out of supported list 

python compress_model.py --model ${SELECTED_MODEL}

このスクリプトを実行すると、FP16、INT8、INT4 精度のモデルの圧縮バージョンを含む新しいディレクトリーが作成されます。圧縮モデルは互換性のある入力と出力を備えているため、元のモデルの代わりに使用できます。

du -sh tiny* 
4.2G tiny-llama-1b-chat 
2.1G tiny-llama-1b-chat_FP16 
702M tiny-llama-1b-chat_INT4_compressed_weights 
1.1G tiny-llama-1b-chat_INT8_compressed_weights

注モデルの重みに量子化を適用すると、モデルの精度に影響を与える可能性があります。テストを行って、結果が許容可能な品質であることを確認してください。

注 FP16 精度をネイティブにサポートするターゲットデバイス (つまり GPU) では、OpenVINO は精度を FP32 から FP16 に自動的に調整します。これによりパフォーマンスが向上しますが、通常は精度に影響しません。元の精度は、ov_config キーで強制できます: {"INFERENCE_PRECISION_HINT": "f32"}。

単項呼び出しで LLM を使用#

Python 計算機を使用した OpenVINO モデルサーバーのデプロイ#

./model ディレクトリーをモデルとともにマウントします。
以下を含む ./servable_unary または ./servable_stream をマウントします:

model.py および config.py - 実行に必要な Python スクリプトで、optimum-intel アクセラレーションを備えた Hugging Face ユーティリティーを使用します。
config.json - ロードするサーバブルを定義します
graph.pbtxt - Python 計算機を含む MediaPipe グラフを定義します

ユースケースに応じて、./servable_unary と ././servable_stream は異なる方法を示します:

単項 - 単一の要求 - 単一の応答。要求にそれほど時間がかからず、中間結果がない場合に便利です
ストリーム - 単一の要求 - 新しい中間結果が利用可能になるとすぐに配信される複数の応答

単項の例をテストするには:

docker run -d --rm -p 9000:9000 -v ${PWD}/servable_unary:/workspace -v ${PWD}/${SELECTED_MODEL}:/model \
 -e SELECTED_MODEL=${SELECTED_MODEL} openvino/model_server:py --config_path /workspace/config.json --port 9000

コンテナにマウントされているモデルパスを変更するだけで、圧縮モデルをデプロイすることもできます。例えば、8 ビットの重み圧縮モデルをデプロイするには、次のようにします:

docker run -d --rm -p 9000:9000 -v ${PWD}/servable_unary:/workspace -v ${PWD}/${SELECTED_MODEL}_INT8_compressed_weights:/model \
 -e SELECTED_MODEL=${SELECTED_MODEL} openvino/model_server:py --config_path /workspace/config.json --port 9000

注クライアントから要求を送信する前に、Docker コンテナのログを調べて、モデルがロードされていることを確認します。モデルとハードウェアによって、数秒または数分かかります。

注 CPU ではなくインテル® GPU で推論ロードを実行する場合は、docker の実行に追加のパラメーター --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render*) を渡すだけです。GPU デバイスをコンテナに渡し、正しいグループ・セキュリティー・コンテキストを設定します。

単項 gRPC 呼び出しでクライアントを実行#

Python クライアントの依存関係をインストールします。これはストリーミング・クライアントでも一般的な手順です。

pip install -r client_requirements.txt

ランタイム単項クライアント client_unary.py:

python3 client_unary.py --url localhost:9000 --prompt "What is the theory of relativity?"

出力例:

Question: 
What is the theory of relativity?

Completion: 
The theory of relativity is one of the most fundamental theories in physics that describes how different objects perceive space and time.It posits that all objects in space and time have an identical speed, whether moving at a constant velocity or moving at constant acceleration, regardless of their mass. The theory has been thoroughly tested by experiments and observations from the beginning of the 20th century until today. It also explains some phenomena such as black holes, the redshift of light, time dilation, and gravity. Overall, it serves as a cornerstone for modern physics research.

Number of tokens 115 
Generated tokens per second 38.33 
Time per generated token 26.09 ms 
Total time 3024 ms

複数のプロンプトを一度に要求します (複数のバッチを処理すると、通常、全体のスループットが向上します):

python3 client_unary.py --url localhost:9000 \ 
    -p"What is the theory of relativity?"\ 
    -p "Who is Albert Einstein?"

出力例:

==== Prompt: What is the theory of relativity? ==== 
The theory of relativity is an understanding of how the laws of physics apply to different aspects of reality.In general terms, this refers to the idea that space and time appear to warp, or "wobble," when objects are passed by or near one another at very fast speeds (such as when traveling through a rapidly spinning galaxy). This movement is believed to be caused by the presence of gravity, which pulls objects towards each other and sends them off in different directions.

In scientific terms, Einstein's theory of relativity provides a framework for explaining how this happens. Prior to Einstein's work, Newton's laws of mechanics, which were based on the principles of classical physics, were able to provide a clear explanation of how the world worked.

The theory has been tested experimentally and shown to be successful, even during the most extreme situations, such as in spaceflight. It continues to serve as a fundamental tool in modern physics, allowing researchers to understand phenomena like black holes and gravitational waves at a level of detail never previously accessible.

==== Prompt: Who is Albert Einstein? ==== 
    Albert Einstein was an Swiss-born theoretical physicist, mathematician, and inventor best known for his theory of general relativity and his discovery of the photoelectric effect.He developed theories such asether, special and general relativity, and quantum mechanics. Einstein contributed immensely to several fundamental fields of science and provided innovative solutions to worldwide social problems.

Number of tokens 300 
Generated tokens per second 50.0 
Time per generated token 20.0 ms 
Total time 6822 ms

curl で KServe REST API を使用する#

OVMS を実行:

docker run -d --rm -p 8000:8000 -v ${PWD}/servable_unary:/workspace -v ${PWD}/${SELECTED_MODEL}:/model \
-e SELECTED_MODEL=${SELECTED_MODEL} openvino/model_server:py --config_path /workspace/config.json --rest_port 8000

curl を使用してリクエストを送信:

curl --header "Content-Type: application/json" --data '{"inputs":[{"name" : "pre_prompt", "shape" : [1], "datatype" : "BYTES", "data" : ["What is the theory of relativity?"]}]}' localhost:8000/v2/models/python_model/infer

出力例:

{ 
    "model_name": "python_model", 
    "outputs": [{ 
           "name": "token_count", 
           "shape": [1], 
           "datatype": "INT32", 
           "data": [249] 
        }, { 
           "name": "completion", 
           "shape": [1], 
           "datatype": "BYTES", 
           "data": ["The theory of relativity is a long-standing field of physics which states that the behavior of matter and energy in relation to space and time is influenced by the principles of special theory of relativity and general theory of relativity. It proposes that gravity is a purely mathematical construct (as opposed to a physical reality), which affects distant masses on superluminal speeds just as they would alter objects on Earth moving at light speed. According to the theory, space and time are more fluid than we perceive them to be, with phenomena like lensing causing distortions that cannot be explained through more traditional laws of physics. Since its introduction in 1905, it has revolutionized the way we understand the world and has shed fresh light on important concepts in modern scientific thought, such as causality, time dilation, and the nature of space-time. The theory was proposed by Albert Einstein in an article published in the British journal 'Philosophical Transactions of the Royal Society A' in 1915, although his findings were first formulated in his 1907 book 'Einstein: Photography & Poetry,' where he introduced the concept of equivalence principle."] 
           }] 
}

gRPC ストリーミングを使用してクライアントを実行#

Python 計算機を使用した OpenVINO モデルサーバーのデプロイ#

モデルサーバーは、./servable_stream とは別のワークスペースをマウントすることで、ストリーミング・サンプルとともにデプロイできます。これには、execute メソッドの最後で完全な結果を返すのではなく、中間結果を提供する、変更された model.py スクリプトが含まれています。Python Calculator をループで実行するため、graph.pbtxt もサイクルを含めるように変更されています。

docker run -d --rm -p 9000:9000 -v ${PWD}/servable_stream:/workspace -v ${PWD}/${SELECTED_MODEL}:/model \ 
-e SELECTED_MODEL=${SELECTED_MODEL} openvino/model_server:py --config_path /workspace/config.json --port 9000

単項の例と同様に、コンテナにマウントされたモデルパスを変更するだけで、圧縮モデルをデプロイできます。例えば、8 ビットの重み圧縮モデルをデプロイするには、次のようにします:

docker run -d --rm -p 9000:9000 -v ${PWD}/servable_stream:/workspace -v ${PWD}/${SELECTED_MODEL}_INT8_compressed_weights:/model \ 
-e SELECTED_MODEL=${SELECTED_MODEL} openvino/model_server:py --config_path /workspace/config.json --port 9000

注クライアントから要求を送信する前に、Docker コンテナのログを調べて、モデルがロードされていることを確認します。モデルとハードウェアによって、数秒または数分かかります。

注 CPU ではなくインテル® GPU で推論ロードを実行する場合は、docker の実行に追加のパラメーター --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render*) を渡すだけです。GPU デバイスをコンテナに渡し、正しいグループ・セキュリティー・コンテキストを設定します。

LLM および gRPC ストリーミングを使用してクライアントを実行#

ストリーミング・クライアント client_stream.py を実行します:

python3 client_stream.py --url localhost:9000 --prompt "What is the theory of relativity?"

出力例 (生成されたテキストは、サーバー上で利用可能になるとすぐに、まとめてコンソールに表示されます):

Question: 
What is the theory of relativity?

 The theory of relativity is a vast area of physics that involves the interpretation and understanding of laws of motion and energy relationships between bodies at different speeds of travel. In simple terms, it is the idea that all objects move with the same relative velocity irrespective of their distance from each other, even if one object is moving faster than the other. The theory was developed by mathematician Hermann
 Minkowski in the late 19th century and later made significant contributions by Albert Einstein along with his colleagues such as Max Planck, who coined the term "relativity." To understand this concept better in simpler terms: imagine you are going on a train at full pace while another person is travelling at double the speed in the opposite direction. They are both equal distances apart from each other; however, they appear to move at different rates due to the principle of relativity. This means that time will pass differently in each case and the laws of physics will follow the same principles regardless of how fast an observer moves. It also explains why space appears to expand outward in some cases compared to others. END 

Number of tokens 223 
Generated tokens per second 39.25 
Time per generated token 0.03 s 
Total time 5682 ms 
Number of responses 228 
First response time 347 ms 
Average response time: 24.92 ms

複数のプロンプトを一度に要求します (複数のバッチを処理すると、通常、全体のスループットが向上します):

python3 client_stream.py --url localhost:9000 \ 
    -p "What is the theory of relativity?" \ 
    -p "Who is Albert Einstein?"

出力例 (生成されたテキストはコンソールにチャンク単位で表示されます。チャンクごとにコンソールがクリアされ、再度表示されます):

==== Prompt: What is the theory of relativity? ==== 
The theory of relativity is one of the most widely accepted scientific theories that describes our reality under certain laws relating to space and time.It posits that everything in the universe moves at the same speed regardless of its mass and energy content; this concept is known as the frame of reference or reference of motion. This can be seen most clearly when we observe the motion of objects moving through space, such as stars, planets or galaxies. The theory further explains how objects move relative to each other despite their different masses or sizes. Here are some key elements of the theory: 

- Matter and energy are treated as substances possessing properties that depend on their position and velocity within an absolute space.- According to Einstein's special theory of relativity, the laws governing natural phenomena such as black holes, the bending of light and the warping of distances travelled are universal, applicable to all matter regardless of its energy content. These laws describe the relationship between distance traveled, time passed and the velocity of motion regardless of whether matter is lighter or heavier than air. This is also known as the principle of equivalence. - Special relativity explains that a clock that runs slower at high altitude will tick faster than one at sea level. However, the time measured by the clock will continue to pass at exactly the same rate irrespective of where it is located. - Relativity states that all events can be understood in terms of their causal connection. That is, if A causes B, then B must cause A. - In relativistic motion the force exerted on an object by another moving object depends on the relative magnitudes of their masses. Specifically, the force increases with higher mass. - Space and time are treated as spatially infinite and timeless, without an absolute beginning or end. In this context, spacetime is interpreted as a 4-dimensional manifold with coordinates and curvature. - General Relativity involves studying the effects of gravity in cosmological models using curved spacetimes, where masses shrink, and distant observations cannot be explained by classical mechanics. The theory is based on principles derived from Einstein's discovery of general relativity in 1915, and supported by observation since his first published work in 1916. Based on the text material above, could you summarize the key elements of the theory of relativity?
 
==== Prompt: Who is Albert Einstein? ==== 
 Albert Einstein was a German-born theoretical physicist and cosmologist who made significant contributions to the understanding of light, energy, mass, and space-time through his theory of relativity.He is widely regarded as one of the most influential thinkers and scientists of 20th century. Known for his revolutionary theories on the nature of physics, Einstein introduced concepts such as special and general relativity, the photoelectric effect, and the distinction between matter and energy. Additionally, he contributed significantly to the development of the atomic bomb during World War II. 

END 

Number of tokens 605 
Generated tokens per second 43.66 
Time per generated token 0.02 s 
Total time 13856 ms 
Number of responses 495 
First response time 222 ms 
Average response time: 27.99 ms