効率的な LLM サービス-#

TinyLlama/TinyLlama-1.1B-Chat-v1.0 モデルをデプロイして、生成をリクエストしてみましょう。

変換スクリプトの Python 依存関係をインストールします:

pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/releases/2024/3/demos/continuous_batching/requirements.txt

モデルをダウンロードして量子化するには、optimal-cli を実行します:

mkdir workspace && cd workspace 

optimum-cli 
export openvino --disable-convert-tokenizer --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int8 TinyLlama-1.1B-Chat-v1.0 

convert_tokenizer -o TinyLlama-1.1B-Chat-v1.0 --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens TinyLlama/TinyLlama-1.1B-Chat-v1.0

モデル・ディレクトリーに graph.pbtxt ファイルを作成します:

echo ' 
input_stream: "HTTP_REQUEST_PAYLOAD:input" 
output_stream: "HTTP_RESPONSE_PAYLOAD:output" 

node: { 
  name: "LLMExecutor" 
  calculator: "HttpLLMCalculator" 
  input_stream: "LOOPBACK:loopback" 
  input_stream: "HTTP_REQUEST_PAYLOAD:input" 
  input_side_packet: "LLM_NODE_RESOURCES:llm" 
  output_stream: "LOOPBACK:loopback" 
  output_stream: "HTTP_RESPONSE_PAYLOAD:output" 
  input_stream_info: { 
    tag_index: "LOOPBACK:0", 
    back_edge: true 
  } 
  node_options: { 
      [type.googleapis.com / mediapipe.LLMCalculatorOptions]: { 
          models_path: "./"       } 
  } 
  input_stream_handler { 
    input_stream_handler: "SyncSetInputStreamHandler", 
    options { 
      [mediapipe.SyncSetInputStreamHandlerOptions.ext] { 
        sync_set { 
          tag_index: "LOOPBACK:0" 
        } 
      } 
    } 
  } 
} 
' >> TinyLlama-1.1B-Chat-v1.0/graph.pbtxt

config.json ファイルサーバー作成:

echo ' 
{ 
    "model_config_list": [], 
    "mediapipe_config_list": [ 
        { 
            "name": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", 
            "base_path": "TinyLlama-1.1B-Chat-v1.0" 
        } 
    ] 
} 
' >> config.json

デプロイ:

docker run -d --rm -p 8000:8000 -v $(pwd)/:/workspace:ro openvino/model_server --rest_port 8000 --config_path /workspace/config.json

モデルのロードを待機します。次のコマンドを使用して状態をチェックします:

curl http://localhost:8000/v1/config 
{ 
  "TinyLlama/TinyLlama-1.1B-Chat-v1.0" : 
  { 
    "model_version_status": [ 
    { 
      "version": "1", 
      "state": "AVAILABLE", 
      "status": { 
        "error_code": "OK", 
        "error_message": "OK" 
      } 
    } 
  ] 
}

生成を実行

curl -s http://localhost:8000/v3/chat/completions \ 
  -H "Content-Type: application/json" \ 
  -d '{ 
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", 
    "max_tokens":30, 
    "stream":false, 
    "messages": [ 
      { 
        "role": "system", 
        "content": "You are a helpful assistant."       }, 
      { 
        "role": "user", 
        "content": "What is OpenVINO?"       } 
    ] 
}'| jq .

{ 
  "choices": [ 
    { 
      "finish_reason": "stop", 
      "index": 0, 
      "logprobs": null, 
      "message": { 
        "content": "OpenVINO is a software toolkit developed by Intel that enables developers to accelerate the training and deployment of deep learning models on Intel hardware.", 
        "role": "assistant" 
      } 
    } 
  ], 
  "created": 1718607923, 
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", 
  "object": "chat.completion" 
}

注: レスポンスチャンクを生成する際にストリームで返す場合は、リクエスト内の stream パラメーターを true に変更します。

効率的な LLM サービス-#

関連情報#