OpenVINO と LangChain を使用して RAG システムを作成#

この Jupyter ノートブックは、ローカルへのインストール後にのみ起動できます。

検索拡張生成 (RAG) は、多くの場合プライベートまたはリアルタイムのデータを追加して LLM 知識を拡張する手法です。LLM は幅広いトピックについて推論できますが、その知識はトレーニングを受けた特定の時点までの公開データに限定されます。プライベート・データやモデルのカットオフ日以降に導入されたデータについて推論できる AI アプリケーションを構築するには、モデルの知識に必要な情報を追加する必要があります。適切な情報を取得してモデルプロンプトに挿入するプロセスは、検索拡張生成 (RAG) と呼ばれます。

LangChain は、言語モデルを活用したアプリケーションを開発するフレームワークです。RAG アプリケーションの構築を支援するために設計されたコンポーネントが多数あります。このチュートリアルでは、テキストデータソースを使用して簡単な質問応答アプリケーションを構築します。

このチュートリアルは次のステップで構成されます:

前提条件のインストール
OpenVINO と Hugging Face Optimum の統合を使用して、パブリックソースからモデルをダウンロードして変換します。
NNCF を使用してモデルの重みを 4 ビットまたは 8 ビットのデータタイプに圧縮します
RAG チェーン・パイプラインを作成します
Q&A パイプラインの実行

この例では、カスタマイズされた RAG パイプラインは次のコンポーネントから順番に構成され、埋め込み、再ランク付け、LLM が OpenVINO とともにデプロイされ、推論パフォーマンスが最適化されます。

目次:

要件
推論用のモデルを選択
huggingfacehub にログインして、事前トレーニングされたモデルにアクセス
モデルを変換しモデルの重みを圧縮
推論用のデバイスとモデルバリアントを選択
モデルのロード
ドキュメントの QA を実行

必要条件#

必要な依存関係をインストールします

import os 

os.environ["GIT_CLONE_PROTECTION_ACTIVE"] = "false" 

%pip install -Uq pip 
%pip uninstall -q -y optimum optimum-intel 
%pip install --pre -Uq openvino openvino-tokenizers[transformers] --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly 
%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu\ 
"git+https://github.com/huggingface/optimum-intel.git"\ 
"git+https://github.com/openvinotoolkit/nncf.git"\ 
"datasets"\ 
"accelerate"\ 
"gradio"\ 
"onnx" "einops" "transformers_stream_generator" "tiktoken" "transformers>=4.40" "bitsandbytes" "faiss-cpu" "sentence_transformers" "langchain>=0.2.0" "langchain-community>=0.2.0" "langchainhub" "unstructured" "scikit-learn" "python-docx" "pypdf"

Note: you may need to restart the kernel to use updated packages. 
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
WARNING: typer 0.12.3 does not provide the extra 'all' 
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed.
This behaviour is the source of the following dependency conflicts. 
llama-index-postprocessor-openvino-rerank 0.1.3 requires huggingface-hub<0.21.0,>=0.20.3, but you have huggingface-hub 0.23.4 which is incompatible. 
llama-index-llms-langchain 0.1.4 requires langchain<0.2.0,>=0.1.3, but you have langchain 0.2.6 which is incompatible. 
Note: you may need to restart the kernel to use updated packages.

import os 
from pathlib import Path 
import requests 
import shutil 
import io 

# fetch model configuration 

config_shared_path = Path("../../utils/llm_config.py") 
config_dst_path = Path("llm_config.py") 
text_example_en_path = Path("text_example_en.pdf") 
text_example_cn_path = Path("text_example_cn.pdf") 
text_example_en = 
"https://github.com/openvinotoolkit/openvino_notebooks/files/15039728/Platform.Brief_Intel.vPro.with.Intel.Core.Ultra_Final.pdf" 
text_example_cn = 
"https://github.com/openvinotoolkit/openvino_notebooks/files/15039713/Platform.Brief_Intel.vPro.with.Intel.Core.Ultra_Final_CH.pdf" 

if not config_dst_path.exists(): 
    if config_shared_path.exists(): 
        try: 
            os.symlink(config_shared_path, config_dst_path) 
        except Exception: 
            shutil.copy(config_shared_path, config_dst_path) 
    else: 
        r = 
requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/llm_config.py") 
        with open("llm_config.py", "w", encoding="utf-8") as f: 
            f.write(r.text) 
elif not os.path.islink(config_dst_path): 
    print("LLM config will be updated") 
    if config_shared_path.exists(): 
        shutil.copy(config_shared_path, config_dst_path) 
    else: 
        r = 
requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/llm_config.py") 
        with open("llm_config.py", "w", encoding="utf-8") as f: 
            f.write(r.text) 

if not text_example_en_path.exists(): 
    r = requests.get(url=text_example_en) 
    content = io.BytesIO(r.content) 
    with open("text_example_en.pdf", "wb") as f: 
        f.write(content.read()) 

if not text_example_cn_path.exists(): 
    r = requests.get(url=text_example_cn) 
    content = io.BytesIO(r.content) 
    with open("text_example_cn.pdf", "wb") as f: 
        f.write(content.read())

LLM config will be updated

推論用のモデルを選択#

このチュートリアルではさまざまなモデルがサポートされており、提供されたオプションから 1 つを選択して、オープンソース LLM ソリューションの品質を比較できます。

注: 一部のモデルの変換には、ユーザーによる追加アクションが必要になる場合があり、変換には少なくとも 64GB RAM が必要です。

利用可能な埋め込みモデルのオプションは次のとおりです:

BGE 埋め込みは一般的な埋め込みモデルです。このモデルは RetroMAE で事前トレーニングされており、対照学習を使用して大規模なペアデータでトレーニングされています。

利用可能な再ランクモデルのオプションは次のとおりです:

クロスエンコーダーを使用したリランカーモデルは、入力ペアに対してフル・アテンションを実行します。これは、埋め込みモデル (つまり、バイエンコーダー) よりも正確ですが、埋め込みモデルよりも時間がかかります。埋め込みモデルによって返された上位 k 個のドキュメントを再ランク付けするのに使用できます。

利用可能な LLM モデルオプションは、llm-chatbot ノートブックでも確認できます。

from pathlib import Path 
import openvino as ov 
import torch 
import ipywidgets as widgets 
from transformers import ( 
    TextIteratorStreamer, 
    StoppingCriteria, 
    StoppingCriteriaList, 
)

モデルを変換しモデルの重みを圧縮#

重み圧縮アルゴリズムは、モデルの重みを圧縮することを目的としており、大規模言語モデル (LLM) など、重みのサイズが活性化のサイズよりも相対的に大きい大規模モデルのモデル・フットプリントとパフォーマンスを最適化するために使用できます。INT8 圧縮と比較して、INT4 圧縮はパフォーマンスをさらに向上させますが、予測品質は若干低下します。

from llm_config import ( 
    SUPPORTED_EMBEDDING_MODELS, 
    SUPPORTED_RERANK_MODELS, 
    SUPPORTED_LLM_MODELS, 
) 

model_languages = list(SUPPORTED_LLM_MODELS) 

model_language = widgets.Dropdown( 
    options=model_languages, 
    value=model_languages[0], 
    description="Model Language:", 
    disabled=False, 
) 

model_language

Dropdown(description='Model Language:', options=('English', 'Chinese', 'Japanese'), value='English')

llm_model_ids = [model_id for model_id, model_config in SUPPORTED_LLM_MODELS[model_language.value].items() if model_config.get("rag_prompt_template")] 

llm_model_id = widgets.Dropdown( 
    options=llm_model_ids, 
    value=llm_model_ids[-1], 
    description="Model:", 
    disabled=False, 
) 

llm_model_id

Dropdown(description='Model:', index=12, options=('tiny-llama-1b-chat', 'gemma-2b-it', 'red-pajama-3b-chat', '…

llm_model_configuration = SUPPORTED_LLM_MODELS[model_language.value][llm_model_id.value] 
print(f"Selected LLM model {llm_model_id.value}")

Selected LLM model neural-chat-7b-v3-1

Optimum Intel は、インテル® アーキテクチャー上のエンドツーエンドのパイプラインを高速化する、トランスフォーマーおよび Diffuser ライブラリーと OpenVINO 間のインターフェイスです。モデルを OpenVINO 中間表現 (IR) 形式にエクスポートする使いやすい CLI インターフェイスを提供します。

以下は、optimum-cli を使用したモデル・エクスポートの基本コマンドを示しています。

optimum-cli export openvino --model <model_id_or_path> --task <task> <out_dir>

--model 引数は、HuggingFace Hub またはモデルのあるローカル・ディレクトリー (.save_pretrained メソッドを使用して保存) のモデル ID であり、--task は、エクスポートされたモデルが解決する必要があるサポートされているタスクの 1 つです。LLM の場合は、text-generation-with-past になります。モデルの初期化にリモートコードを使用する場合は、--trust-remote-code フラグを渡す必要があります。

Optimum-CLI を使用した LLM 変換と重み圧縮#

また、CLI を使用してモデルをエクスポートするときに、--weight-format をそれぞれ fp16、int8、int4 に設定することで、線形、畳み込み、埋め込みレイヤーに fp16、8 ビット、または 4 ビットの重み圧縮を適用することもできます。このタイプ最適化により、メモリー・フットプリントと推論の待ち時間を削減できます。デフォルトでは、int8/int4 の量子化スキームは非対称になりますが、対称にするには --sym を追加します。

INT4 量子化の場合、次の引数を指定することもできます:

--group-size パラメーターは量子化に使用するグループサイズを定義します。-1 の場合はカラムごとの量子化になります。
--ratio パラメーターは、4 ビットと 8 ビットの量子化の比率を制御します。例えば、0.8 は、レイヤーの 90% が int4 に圧縮され、残りが int8 精度に圧縮されることを意味します。

group_size と ratio の値が小さいほど、モデルのサイズと推論のレイテンシーが犠牲になりますが、通常は精度が向上します。

注: dGPU 上の INT4/INT8 圧縮モデルではスピードアップされない可能性があります。

from IPython.display import Markdown, display 

prepare_int4_model = widgets.Checkbox( 
    value=True, 
    description="Prepare INT4 model", 
    disabled=False, 
) 
prepare_int8_model = widgets.Checkbox( 
    value=False, 
    description="Prepare INT8 model", 
    disabled=False, 
) 
prepare_fp16_model = widgets.Checkbox( 
    value=False, 
    description="Prepare FP16 model", 
    disabled=False, 
) 

display(prepare_int4_model) 
display(prepare_int8_model) 
display(prepare_fp16_model)

Checkbox(value=True, description='Prepare INT4 model')

Checkbox(value=False, description='Prepare INT8 model')

Checkbox(value=False, description='Prepare FP16 model')

AWQ を使用した重み圧縮#

活性化認識重み量子化 (AWQ) は、モデルの重みを調整して INT4 圧縮の精度を高めるアルゴリズムです。圧縮された LLM の生成品質はわずかに向上しますが、キャリブレーション・データセットの重みを調整するのにかなりの時間が必要になります。キャリブレーションには、Wikitext データセットの wikitext-2-raw-v1/train サブセットを使用します。

以下では、INT4 精度でモデルのエクスポート中に AWQ を適用できるようにします。

注: AWQ を適用するには、かなりのメモリーと時間が必要です。

注: AWQ を適用するモデルに一致するパターンが存在しない可能性があり、その場合はスキップされます。

enable_awq = widgets.Checkbox( 
    value=False, 
    description="Enable AWQ", 
    disabled=not prepare_int4_model.value, 
) 
display(enable_awq)

Checkbox(value=False, description='Enable AWQ')

pt_model_id = llm_model_configuration["model_id"] 
pt_model_name = llm_model_id.value.split("-")[0] 
fp16_model_dir = Path(llm_model_id.value) / "FP16" 
int8_model_dir = Path(llm_model_id.value) / "INT8_compressed_weights" 
int4_model_dir = Path(llm_model_id.value) / "INT4_compressed_weights" 

def convert_to_fp16(): 
    if (fp16_model_dir / "openvino_model.xml").exists(): 
        return 
    remote_code = llm_model_configuration.get("remote_code", False) 
    export_command_base = "optimum-cli export openvino --model {} --task text-generation-with-past --weight-format fp16".format(pt_model_id) 
    if remote_code: 
        export_command_base += " --trust-remote-code" 
    export_command = export_command_base + " " + str(fp16_model_dir) 
    display(Markdown("**Export command:**")) 
    display(Markdown(f"`{export_command}`")) 
    ! $export_command 

def convert_to_int8(): 
    if (int8_model_dir / "openvino_model.xml").exists(): 
        return 
    int8_model_dir.mkdir(parents=True, exist_ok=True) 
    remote_code = llm_model_configuration.get("remote_code", False) 
    export_command_base = "optimum-cli export openvino --model {} --task text-generation-with-past --weight-format int8".format(pt_model_id) 
    if remote_code: 
        export_command_base += " --trust-remote-code" 
    export_command = export_command_base + " " + str(int8_model_dir) 
    display(Markdown("**Export command:**")) 
    display(Markdown(f"`{export_command}`")) 
    ! $export_command 

def convert_to_int4(): 
    compression_configs = { 
        "zephyr-7b-beta": { 
            "sym": True, "group_size": 64, 
            "ratio": 0.6, 
        }, 
        "mistral-7b": { 
            "sym": True, "group_size": 64, 
            "ratio": 0.6, 
        }, 
        "minicpm-2b-dpo": { 
            "sym": True, "group_size": 64, 
            "ratio": 0.6, 
        }, 
        "gemma-2b-it": { 
            "sym": True, "group_size": 64, 
            "ratio": 0.6, 
        }, 
        "notus-7b-v1": { 
            "sym": True, "group_size": 64, 
            "ratio": 0.6, 
        }, 
        "neural-chat-7b-v3-1": { 
            "sym": True, "group_size": 64, 
            "ratio": 0.6, 
        }, 
        "llama-2-chat-7b": { 
            "sym": True, "group_size": 128, 
            "ratio": 0.8, 
        }, 
        "llama-3-8b-instruct": { 
            "sym": True, "group_size": 128, 
            "ratio": 0.8, 
        }, 
        "gemma-7b-it": { 
            "sym": True, "group_size": 128, 
            "ratio": 0.8, 
        }, 
        "chatglm2-6b": { 
            "sym": True, "group_size": 128, 
            "ratio": 0.72, 
        }, 
        "qwen-7b-chat": {"sym": True, "group_size": 128, 
            "ratio": 0.6}, 
        "red-pajama-3b-chat": { 
            "sym": False, 
            "group_size": 128, 
            "ratio": 0.5, 
        }, 
        "default": { 
            "sym": False, 
            "group_size": 128, 
            "ratio": 0.8, 
        }, 
    } 

    model_compression_params = compression_configs.get(llm_model_id.value, compression_configs["default"]) 
    if (int4_model_dir / "openvino_model.xml").exists(): 
        return 
    remote_code = llm_model_configuration.get("remote_code", False) 
    export_command_base = "optimum-cli export openvino --model {} --task text-generation-with-past --weight-format int4".format(pt_model_id) 
    int4_compression_args = " --group-size {} --ratio {}".format(model_compression_params["group_size"], model_compression_params["ratio"]) 
    if model_compression_params["sym"]: 
        int4_compression_args += " --sym" if enable_awq.value: int4_compression_args += " --awq --dataset wikitext2 --num-samples 128" 
    export_command_base += int4_compression_args 
    if remote_code: 
        export_command_base += " --trust-remote-code" 
    export_command = export_command_base + " " + str(int4_model_dir) 
    display(Markdown("**Export command:**")) 
    display(Markdown(f"`{export_command}`")) 
    ! $export_command 

if prepare_fp16_model.value: 
    convert_to_fp16() 
if prepare_int8_model.value: 
    convert_to_int8() 
if prepare_int4_model.value: 
    convert_to_int4()

さまざまな圧縮タイプのモデルサイズを比較してみましょう

fp16_weights = fp16_model_dir / "openvino_model.bin" 
int8_weights = int8_model_dir / "openvino_model.bin" 
int4_weights = int4_model_dir / "openvino_model.bin" 

if fp16_weights.exists(): 
    print(f"Size of FP16 model is {fp16_weights.stat().st_size / 1024 / 1024:.2f} MB") 
for precision, compressed_weights in zip([8, 4], [int8_weights, int4_weights]): 
    if compressed_weights.exists(): 
        print(f"Size of model with INT{precision} compressed weights is {compressed_weights.stat().st_size / 1024 / 1024:.2f} MB") 
    if compressed_weights.exists() and fp16_weights.exists(): 
        print(f"Compression rate for INT{precision} model: {fp16_weights.stat().st_size / compressed_weights.stat().st_size:.3f}")

Size of model with INT4 compressed weights is 5069.90 MB

Optimum-CLI を使用して埋め込みモデルを変換#

一部の埋め込みモデルは限られた言語しかサポートできないため、選択した LLM に応じてそれらを除外できます。

embedding_model_id = list(SUPPORTED_EMBEDDING_MODELS[model_language.value]) 

embedding_model_id = widgets.Dropdown( 
    options=embedding_model_id, 
    value=embedding_model_id[0], 
    description="Embedding Model:", 
    disabled=False, 
) 

embedding_model_id

Dropdown(description='Embedding Model:', options=('bge-small-en-v1.5', 'bge-large-en-v1.5'), value='bge-small-…

embedding_model_configuration = SUPPORTED_EMBEDDING_MODELS[model_language.value][embedding_model_id.value] 
print(f"Selected {embedding_model_id.value} model")

Selected bge-small-en-v1.5 model

OpenVINO 埋め込みモデルとトークナイザーは、optimum-cli を使用した feature-extraction タスクによってエクスポートできます。

export_command_base = "optimum-cli export openvino --model {} --task feature-extraction".format(embedding_model_configuration["model_id"]) 
export_command = export_command_base + " " + str(embedding_model_id.value) 

if not Path(embedding_model_id.value).exists(): 
    ! $export_command

Optimum-CLI を使用して再ランクモデルを変換#

rerank_model_id = list(SUPPORTED_RERANK_MODELS) 

rerank_model_id = widgets.Dropdown( 
    options=rerank_model_id, 
    value=rerank_model_id[0], 
    description="Rerank Model:", 
    disabled=False, 
) 

rerank_model_id

Dropdown(description='Rerank Model:', options=('bge-reranker-large', 'bge-reranker-base'), value='bge-reranker…

rerank_model_configuration = SUPPORTED_RERANK_MODELS[rerank_model_id.value] 
print(f"Selected {rerank_model_id.value} model")

Selected bge-reranker-large model

再ランク付け (rerank) モデルは一種の文分類タスクであるため、その OpenVINO IR とトークナイザーは、optimum-cli を使用した text-classification タスクによってエクスポートできます。

export_command_base = "optimum-cli export openvino --model {} --task text-classification".format(rerank_model_configuration["model_id"]) 
export_command = export_command_base + " " + str(rerank_model_id.value) 

if not Path(rerank_model_id.value).exists(): 
    ! $export_command

推論用のデバイスとモデルバリアントを選択#

注: dGPU 上の INT4/INT8 圧縮モデルではスピードアップされない可能性があります。

モデル推論を埋め込むデバイスを選択#

core = ov.Core() 

support_devices = core.available_devices 

embedding_device = widgets.Dropdown( 
    options=support_devices + ["AUTO"], 
    value="CPU", 
    description="Device:", 
    disabled=False, 
) 

embedding_device

Dropdown(description='Device:', options=('CPU', 'AUTO'), value='CPU')

print(f"Embedding model will be loaded to {embedding_device.value} device for text embedding")

Embedding model will be loaded to CPU device for text embedding

モデルを NPU デバイスにロードするときに、BGE 埋め込みモデルのパラメーター精度を最適化します。

USING_NPU = embedding_device.value == "NPU" 

npu_embedding_dir = embedding_model_id.value + "-npu" 
npu_embedding_path = Path(npu_embedding_dir) / "openvino_model.xml" 
if USING_NPU and not Path(npu_embedding_dir).exists(): 
    r = requests.get( 

url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py", 
    ) 
    with open("notebook_utils.py", "w") as f: 
        f.write(r.text) 
    import notebook_utils as utils 

    shutil.copytree(embedding_model_id.value, npu_embedding_dir) 
    utils.optimize_bge_embedding(Path(embedding_model_id.value) / "openvino_model.xml", npu_embedding_path)

再ランクモデル推論用のデバイスを選択#

rerank_device = widgets.Dropdown( 
    options=support_devices + ["AUTO"], 
    value="CPU", 
    description="Device:", 
    disabled=False, 
) 

rerank_device

Dropdown(description='Device:', options=('CPU', 'AUTO'), value='CPU')

print(f"Rerenk model will be loaded to {rerank_device.value} device for text reranking")

Rerenk model will be loaded to CPU device for text reranking

LLM モデル推論用のデバイスを選択#

llm_device = widgets.Dropdown( 
    options=support_devices + ["AUTO"], 
    value="CPU", 
    description="Device:", 
    disabled=False, 
) 

llm_device

Dropdown(description='Device:', options=('CPU', 'AUTO'), value='CPU')

print(f"LLM model will be loaded to {llm_device.value} device for response generation")

LLM model will be loaded to CPU device for response generation

モデルのロード#

埋め込みモデルのロード#

現在、Hugging Face 埋め込みモデルは、LangChain の OpenVINOEmbeddings および OpenVINOBgeEmbeddings クラスを通じて OpenVINO でサポートできます。

from langchain_community.embeddings import OpenVINOBgeEmbeddings 

embedding_model_name = npu_embedding_dir if USING_NPU else embedding_model_id.value 
batch_size = 1 if USING_NPU else 4 
embedding_model_kwargs = {"device": embedding_device.value, "compile": False} 
encode_kwargs = { 
    "mean_pooling": embedding_model_configuration["mean_pooling"], 
    "normalize_embeddings": embedding_model_configuration["normalize_embeddings"], 
    "batch_size": batch_size, 
} 

embedding = OpenVINOBgeEmbeddings( 
    model_name_or_path=embedding_model_name, 
    model_kwargs=embedding_model_kwargs, 
    encode_kwargs=encode_kwargs, 
) 
if USING_NPU: 
    embedding.ov_model.reshape(1, 512) 
embedding.ov_model.compile() 

text = "This is a test document." 
embedding_result = embedding.embed_query(text) 
embedding_result[:3]

Compiling the model to CPU ...

[-0.04208654910326004, 0.06681869924068451, 0.007916687056422234]

再ランクモデルのロード#

現在、Hugging Face 埋め込みモデルは、LangChain の OpenVINOReranker クラスを通じて OpenVINO でサポートできます。

注: RAG では再ランク付けをスキップできます。

from langchain_community.document_compressors.openvino_rerank import OpenVINOReranker 

rerank_model_name = rerank_model_id.value 
rerank_model_kwargs = {"device": rerank_device.value} 
rerank_top_n = 2 

reranker = OpenVINOReranker( 
    model_name_or_path=rerank_model_name, 
    model_kwargs=rerank_model_kwargs, 
    top_n=rerank_top_n, 
)

Compiling the model to CPU ...

LLM モデルのロード#

OpenVINO モデルは、 HuggingFacePipeline クラスを通じてローカルで実行できます。OpenVINO を使用してモデルをデプロイするには、backend="openvino" パラメーターを指定し、バックエンド推論フレームワークとして OpenVINO をトリガーします。

available_models = [] 
if int4_model_dir.exists(): 
    available_models.append("INT4") 
if int8_model_dir.exists(): 
    available_models.append("INT8") 
if fp16_model_dir.exists(): 
    available_models.append("FP16") 

model_to_run = widgets.Dropdown( 
    options=available_models, 
    value=available_models[0], 
    description="Model to run:", 
    disabled=False, 
) 

model_to_run

Dropdown(description='Model to run:', options=('INT4',), value='INT4')

OpenVINO モデルは、LangChain の HuggingFacePipeline クラスを通じてローカルで実行できます。OpenVINO を使用してモデルをデプロイするには、backend="openvino" パラメーターを指定し、バックエンド推論フレームワークとして OpenVINO をトリガーします。

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline 

if model_to_run.value == "INT4": 
    model_dir = int4_model_dir 
elif model_to_run.value == "INT8": 
    model_dir = int8_model_dir 
else: 
    model_dir = fp16_model_dir 
print(f"Loading model from {model_dir}") 

ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": "1", "CACHE_DIR": ""} 

if "GPU" in llm_device.value and "qwen2-7b-instruct" in llm_model_id.value: 
    ov_config["GPU_ENABLE_SDPA_OPTIMIZATION"] = "NO" 

# GPU デバイスでは、モデルは FP16 精度で実行されます。red-pajama-3b-chat モデルでは、これが原因で精度の問題が発生することが知られていますが、 
# 精度ヒントを "f32" に設定することでこれを回避できます 
if llm_model_id.value == "red-pajama-3b-chat" and "GPU" in core.available_devices and llm_device.value in ["GPU", "AUTO"]: 
    ov_config["INFERENCE_PRECISION_HINT"] = "f32" 

llm = HuggingFacePipeline.from_model_id( 
    model_id=str(model_dir), 
    task="text-generation", 
    backend="openvino", 
    model_kwargs={ 
        "device": llm_device.value, 
        "ov_config": ov_config, 
        "trust_remote_code": True, 
    }, 
    pipeline_kwargs={"max_new_tokens": 2}, 
) 

llm.invoke("2 + 2 =")

The argument trust_remote_code is to be used along with export=True.It will be ignored.

Loading model from neural-chat-7b-v3-1/INT4_compressed_weights

Compiling the model to CPU ...

'2 + 2 = 4'

ドキュメントの QA を実行#

モデルが作成されたら、Gradio を使用してチャットボット・インターフェイスをセットアップできます。

一般的な RAG アプリケーションには、次の 2 つの主要コンポーネントがあります:

インデックス作成: ソースからデータを取り込み、インデックスを作成するパイプライン。これは通常、オフラインで発生します。
取得と生成: 実際の RAG チェーンは、実行時にユーザークエリーを受け取り、インデックスから関連データを取得して、それをモデルに渡します。

生データから回答までの最も一般的な完全なシーケンスは次のようになります:

インデックス作成

Load: まずデータをロードする必要があります。これには DocumentLoaders を使用します。
Split: テキスト分割ツールは、大きなドキュメントを小さなチャンクに分割します。これは、データのインデックス作成とモデルへのデータの受け渡しの両方に役立ちます。大きなチャンクは検索が困難であり、モデルの有限コンテキスト・ウィンドウでは検索できないためです。
Store: 後で検索できるように、分割を保存してインデックスを作成する場所が必要です。これは、多くの場合、VectorStore と Embeddings モデルを使用して行われます。

取得と生成

Retrieve: ユーザー入力があれば、Retriever を使用して関連する分割がストレージから取得されます。
Generate: LLM は、質問と取得したデータを含むプロンプトを使用して回答を生成します。

import re 
from typing import List 
from langchain.text_splitter import ( 
    CharacterTextSplitter, 
    RecursiveCharacterTextSplitter, 
    MarkdownTextSplitter, 
) 
from langchain.document_loaders import ( 
    CSVLoader, 
    EverNoteLoader, 
    PyPDFLoader, 
    TextLoader, 
    UnstructuredEPubLoader, 
    UnstructuredHTMLLoader, 
    UnstructuredMarkdownLoader, 
    UnstructuredODTLoader, 
    UnstructuredPowerPointLoader, 
    UnstructuredWordDocumentLoader,
) 

class ChineseTextSplitter(CharacterTextSplitter): 
    def __init__(self, pdf: bool = False, **kwargs): 
        super().__init__(**kwargs) 
        self.pdf = pdf 

    def split_text(self, text: str) -> List[str]: 
        if self.pdf: 
            text = re.sub(r"\n{3,}", "\n", text) 
            text = text.replace("\n\n", "") 
        sent_sep_pattern = re.compile('([﹒﹔﹖﹗．。！？]["’”」』]{0,2}|(?=["‘“「『]{1,2}|$))') 
        sent_list = [] 
        for ele in sent_sep_pattern.split(text): 
            if sent_sep_pattern.match(ele) and sent_list: 
                sent_list[-1] += ele 
            elif ele: 
                sent_list.append(ele) 
        return sent_list 

TEXT_SPLITERS = { 
    "Character": CharacterTextSplitter, 
    "RecursiveCharacter": RecursiveCharacterTextSplitter, 
    "Markdown": MarkdownTextSplitter, 
    "Chinese": ChineseTextSplitter, 
} 

LOADERS = { 
    ".csv": (CSVLoader, {}), 
    ".doc": (UnstructuredWordDocumentLoader, {}), 
    ".docx": (UnstructuredWordDocumentLoader, {}), 
    ".enex": (EverNoteLoader, {}), 
    ".epub": (UnstructuredEPubLoader, {}), 
    ".html": (UnstructuredHTMLLoader, {}), 
    ".md": (UnstructuredMarkdownLoader, {}), 
    ".odt": (UnstructuredODTLoader, {}), 
    ".pdf": (PyPDFLoader, {}), 
    ".ppt": (UnstructuredPowerPointLoader, {}), 
    ".pptx": (UnstructuredPowerPointLoader, {}), 
    ".txt": (TextLoader, {"encoding": "utf8"}), 
} 

chinese_examples = [ 
    ["英特尔®酷睿™ Ultra处理器可以降低多少功耗？"], 
    ["相比英特尔之前的移动处理器产品，英特尔®酷睿™ Ultra处理器的AI推理性能提升了多少？"], 
    ["英特尔博锐® Enterprise系统提供哪些功能？"], 
] 

english_examples = [ 
    ["How much power consumption can Intel® Core™ Ultra Processors help save?"], 
    ["Compared to Intel’s previous mobile processor, what is the advantage of Intel® Core™ Ultra Processors for Artificial Intelligence?"], 
    ["What can Intel vPro® Enterprise systems offer?"], 
] 

if model_language.value == "English": 
    text_example_path = "text_example_en.pdf" 
else: 
    text_example_path = "text_example_cn.pdf" 

examples = chinese_examples if (model_language.value == "Chinese") else english_examples

create_retrieval_chain を通じて LangChain の RAG パイプラインを構築できます。これにより、次のような RAG コンポーネントを接続するチェーンを作成できます:

from langchain.prompts import PromptTemplate 
from langchain_community.vectorstores import FAISS 
from langchain.chains.retrieval import create_retrieval_chain 
from langchain.chains.combine_documents import create_stuff_documents_chain 
from langchain.docstore.document import Document 
from langchain.retrievers import ContextualCompressionRetriever 
from threading import Thread 
import gradio as gr 

stop_tokens = llm_model_configuration.get("stop_tokens") 
rag_prompt_template = llm_model_configuration["rag_prompt_template"] 

class StopOnTokens(StoppingCriteria): 
    def __init__(self, token_ids): 
        self.token_ids = token_ids 

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool: 
        for stop_id in self.token_ids: 
            if input_ids[0][-1] == stop_id: 
                return True 
            return False 

if stop_tokens is not None: 
    if isinstance(stop_tokens[0], str): 
        stop_tokens = llm.pipeline.tokenizer.convert_tokens_to_ids(stop_tokens) 

    stop_tokens = [StopOnTokens(stop_tokens)] 

def load_single_document(file_path: str) -> List[Document]: 
    """ 
    helper for loading a single document 

    Params: 
        file_path: document path 
    Returns: 
        documents loaded
 
    """ 
    ext = "."+ file_path.rsplit(".", 1)[-1] 
    if ext in LOADERS: 
        loader_class, loader_args = LOADERS[ext] 
        loader = loader_class(file_path, **loader_args) 
        return loader.load() 

    raise ValueError(f"File does not exist '{ext}'") 

def default_partial_text_processor(partial_text: str, new_text: str): 
    """ 
    helper for updating partially generated answer, used by default 

    Params: 
        partial_text: text buffer for storing previosly generated text
        new_text: text update for the current step 
    Returns: 
        updated text string 

    """ 
    partial_text += new_text 
    return partial_text 

text_processor = llm_model_configuration.get("partial_text_processor", default_partial_text_processor) 

def create_vectordb( 
    docs, spliter_name, chunk_size, chunk_overlap, vector_search_top_k, vector_rerank_top_n, run_rerank, search_method, score_threshold, progress=gr.Progress() 
): 
    """ 
    Initialize a vector database 

    Params: 
        doc: orignal documents provided by user 
        spliter_name: spliter method 
        chunk_size: size of a single sentence chunk 
        chunk_overlap: overlap size between 2 chunks 
        vector_search_top_k: Vector search top k 
        vector_rerank_top_n: Search rerank top n 
        run_rerank: whether run reranker 
        search_method: top k search method 
        score_threshold: score threshold when selecting 'similarity_score_threshold' method
 
    """ 
    global db 
    global retriever 
    global combine_docs_chain 
    global rag_chain 

    if vector_rerank_top_n > vector_search_top_k: 
        gr.Warning("Search top k must >= Rerank top n") 

    documents = [] 
    for doc in docs: 
        if type(doc) is not str: 
            doc = doc.name 
        documents.extend(load_single_document(doc)) 

    text_splitter = TEXT_SPLITERS[spliter_name](chunk_size=chunk_size, chunk_overlap=chunk_overlap) 

    texts = text_splitter.split_documents(documents) 
    db = FAISS.from_documents(texts, embedding) 
    if search_method == "similarity_score_threshold": 
        search_kwargs = {"k": vector_search_top_k, "score_threshold": score_threshold} 
    else: 
        search_kwargs = {"k": vector_search_top_k} 
    retriever = db.as_retriever(search_kwargs=search_kwargs, search_type=search_method) 
    if run_rerank: 
        reranker.top_n = vector_rerank_top_n 
        retriever = ContextualCompressionRetriever(base_compressor=reranker, base_retriever=retriever) 
    prompt = PromptTemplate.from_template(rag_prompt_template) 
    combine_docs_chain = create_stuff_documents_chain(llm, prompt) 

    rag_chain = create_retrieval_chain(retriever, combine_docs_chain) 

    return "Vector database is Ready" 

def update_retriever(vector_search_top_k, vector_rerank_top_n, run_rerank, search_method, score_threshold): 
    """ 
    Update retriever 

    Params: 
        vector_search_top_k: Vector search top k 
        vector_rerank_top_n: Search rerank top n 
        run_rerank: whether run reranker 
        search_method: top k search method 
        score_threshold: score threshold when selecting 'similarity_score_threshold' method 

    """ 
    global db 
    global retriever 
    global combine_docs_chain 
    global rag_chain 

    if vector_rerank_top_n > vector_search_top_k: 
        gr.Warning("Search top k must >= Rerank top n") 

    if search_method == "similarity_score_threshold": 
        search_kwargs = {"k": vector_search_top_k, "score_threshold": score_threshold} 
    else: 
        search_kwargs = {"k": vector_search_top_k} 
    retriever = db.as_retriever(search_kwargs=search_kwargs, search_type=search_method) 
    if run_rerank: 
        retriever = ContextualCompressionRetriever(base_compressor=reranker, base_retriever=retriever) 
        reranker.top_n = vector_rerank_top_n 
    rag_chain = create_retrieval_chain(retriever, combine_docs_chain) 

    return "Vector database is Ready" 

def user(message, history): 
    """ 
    callback function for updating user messages in interface on submit button click 

    Params: 
        message: current message 
        history: conversation history 
    Returns: 
        None 
    """ 
    # ユーザーのメッセージを会話履歴に追加 
    return "", history + [[message, ""]] 

def bot(history, temperature, top_p, top_k, repetition_penalty, hide_full_prompt, do_rag): 
    """ 
    callback function for running chatbot on submit button click 

    Params: 
        history: conversation history 
        temperature: parameter for control the level of creativity in AI-generated text.
                    By adjusting the `temperature`, you can influence the AI model's probability distribution, making the text more focused or diverse. 
        top_p: parameter for control the range of tokens considered by the AI model based on their cumulative probability. 
        top_k: parameter for control the range of tokens considered by the AI model based on their cumulative probability, selecting number of tokens with highest probability. 
        repetition_penalty: parameter for penalizing tokens based on how frequently they occur in the text. 
        conversation_id: unique conversation identifier.
    """ 
    streamer = TextIteratorStreamer( 
        llm.pipeline.tokenizer, 
        timeout=60.0, 
        skip_prompt=hide_full_prompt, 
        skip_special_tokens=True, 
    ) 
    llm.pipeline._forward_params = dict( 
        max_new_tokens=512, 
        temperature=temperature, 
        do_sample=temperature > 0.0, 
        top_p=top_p, 
        top_k=top_k, 
        repetition_penalty=repetition_penalty, 
        streamer=streamer, 
    ) 
    if stop_tokens is not None: 
        llm.pipeline._forward_params["stopping_criteria"] = StoppingCriteriaList(stop_tokens) 

    if do_rag: 
        t1 = Thread(target=rag_chain.invoke, args=({"input": history[-1][0]},)) 
    else: 
        input_text = rag_prompt_template.format(input=history[-1][0], context="") 
        t1 = Thread(target=llm.invoke, args=(input_text,)) 
    t1.start() 

    # 生成されたテキストを保存するために空の文字列を初期化 
    partial_text = "" 
    for new_text in streamer: 
        partial_text = text_processor(partial_text, new_text) 
        history[-1][1] = partial_text 
        yield history 

def request_cancel(): 
    llm.pipeline.model.request.cancel() 

def clear_files(): 
    return "Vector Store is Not ready" 

# サンプル・ドキュメントでベクトルストアを初期化 
create_vectordb( 
    [text_example_path], 
    "RecursiveCharacter", 
    chunk_size=400, 
    chunk_overlap=50, 
    vector_search_top_k=10, 
    vector_rerank_top_n=2, 
    run_rerank=True, 
    search_method="similarity_score_threshold", 
    score_threshold=0.5, 
)

'Vector database is Ready'

次に、Gradio UI を作成し、デモを実行します。

with gr.Blocks( 
    theme=gr.themes.Soft(), 
    css=".disclaimer {font-variant-caps: all-small-caps;}", 
) as demo: 
    gr.Markdown("""<h1><center>QA over Document</center></h1>""") 
    gr.Markdown(f"""<center>Powered by OpenVINO and {llm_model_id.value} </center>""") 
    with gr.Row(): 
        with gr.Column(scale=1): 
            docs = gr.File( 
                label="Step 1: Load text files", 
                value=[text_example_path], 
                file_count="multiple", 
                file_types=[ 
                    ".csv", 
                    ".doc", 
                    ".docx", 
                    ".enex", 
                    ".epub", 
                    ".html", 
                    ".md", 
                    ".odt", 
                    ".pdf", 
                    ".ppt", 
                    ".pptx", 
                    ".txt", 
                ], 
            ) 
            load_docs = gr.Button("Step 2: Build Vector Store", variant="primary") 
            db_argument = gr.Accordion("Vector Store Configuration", open=False) 
            with db_argument: 
                spliter = gr.Dropdown( 
                    ["Character", "RecursiveCharacter", "Markdown", "Chinese"], 
                    value="RecursiveCharacter", 
                    label="Text Spliter", 
                    info="Method used to splite the documents", 
                    multiselect=False, 
                ) 

                chunk_size = gr.Slider( 
                    label="Chunk size", 
                    value=400, 
                    minimum=50, 
                    maximum=2000, 
                    step=50, 
                    interactive=True, 
                    info="Size of sentence chunk", 
                ) 

                chunk_overlap = gr.Slider( 
                    label="Chunk overlap", 
                    value=50, 
                    minimum=0, 
                    maximum=400, 
                    step=10, 
                    interactive=True, 
                    info=("Overlap between 2 chunks"), 
                ) 
            langchain_status = gr.Textbox( 
                label="Vector Store Status", 
                value="Vector Store is Ready", 
                interactive=False, 
            ) 
            do_rag = gr.Checkbox( 
                value=True, 
                label="RAG is ON", 
                interactive=True, 
                info="Whether to do RAG for generation", 
            ) 
            with gr.Accordion("Generation Configuration", open=False): 
                with gr.Row(): 
                    with gr.Column(): 
                        with gr.Row(): 
                            temperature = gr.Slider( 
                                label="Temperature", 
                                value=0.1, 
                                minimum=0.0, 
                                maximum=1.0, 
                                step=0.1, 
                                interactive=True, 
                                info="Higher values produce more diverse outputs", 
                            ) 
                    with gr.Column(): 
                        with gr.Row(): 
                            top_p = gr.Slider( 
                                label="Top-p (nucleus sampling)", 
                                value=1.0, 
                                minimum=0.0, 
                                maximum=1, 
                                step=0.01, 
                                interactive=True, 
                                info=( 
                                    "Sample from the smallest possible set of tokens whose cumulative probability " 
                                    "exceeds top_p. Set to 1 to disable and sample from all tokens."                                 ), 
                            ) 
                with gr.Column(): 
                    with gr.Row(): 
                        top_k = gr.Slider( 
                            label="Top-k", 
                            value=50, 
                            minimum=0.0, 
                            maximum=200, 
                            step=1, 
                            interactive=True, 
                            info="Sample from a shortlist of top-k tokens — 0 to disable and sample from all tokens.", 
                        ) 
                with gr.Column(): 
                    with gr.Row(): 
                        repetition_penalty = gr.Slider( 
                            label="Repetition Penalty", 
                            value=1.1, 
                            minimum=1.0, 
                            maximum=2.0, 
                            step=0.1, 
                            interactive=True, 
                            info="Penalize repetition — 1.0 to disable.", 
                        ) 
        with gr.Column(scale=4): 
            chatbot = gr.Chatbot( 
                height=800, 
                label="Step 3: Input Query", 
            ) 
            with gr.Row(): 
                with gr.Column(): 
                    with gr.Row(): 
                        msg = gr.Textbox( 
                            label="QA Message Box", 
                            placeholder="Chat Message Box", 
                            show_label=False, 
                            container=False, 
                        ) 
                with gr.Column(): 
                    with gr.Row(): 
                        submit = gr.Button("Submit", variant="primary") 
                        stop = gr.Button("Stop") 
                        clear = gr.Button("Clear") 
        gr.Examples(examples, inputs=msg, label="Click on any example and press the 'Submit' button") 
        retriever_argument = gr.Accordion("Retriever Configuration", open=True) 
        with retriever_argument: 
            with gr.Row(): 
                with gr.Row(): 
                    do_rerank = gr.Checkbox( 
                        value=True, 
                        label="Rerank searching result", 
                        interactive=True, 
                    ) 
                    hide_context = gr.Checkbox( 
                        value=True, 
                        label="Hide searching result in prompt", 
                        interactive=True, 
                    ) 
                with gr.Row(): 
                    search_method = gr.Dropdown( 
                        ["similarity_score_threshold", "similarity", "mmr"], 
                        value="similarity_score_threshold", 
                        label="Searching Method", 
                        info="Method used to search vector store", 
                        multiselect=False, 
                        interactive=True, 
                    ) 
                with gr.Row(): 
                    score_threshold = gr.Slider( 
                        0.01, 
                        0.99, 
                        value=0.5, 
                        step=0.01, 
                        label="Similarity Threshold", 
                        info="Only working for 'similarity score threshold' method", 
                        interactive=True, 
                    ) 
                with gr.Row(): 
                    vector_rerank_top_n = gr.Slider( 
                        1, 
                        10, 
                        value=2, 
                        step=1, 
                        label="Rerank top n", 
                        info="Number of rerank results", 
                        interactive=True, 
                    ) 
                with gr.Row(): 
                    vector_search_top_k = gr.Slider( 
                        1, 
                        50, 
                        value=10, 
                        step=1, 
                        label="Search top k", 
                        info="Search top k must >= Rerank top n", 
                        interactive=True, 
                    ) 
    docs.clear(clear_files, outputs=[langchain_status], queue=False) 
    load_docs.click( 
        create_vectordb, 
        inputs=[docs, spliter, chunk_size, chunk_overlap, vector_search_top_k, vector_rerank_top_n, do_rerank, search_method, score_threshold], 
        outputs=[langchain_status], 
        queue=False, 
) 
submit_event = msg.submit(user, [msg, chatbot], [msg, chatbot], queue=False).then( 
    bot, 
    [chatbot, temperature, top_p, top_k, repetition_penalty, hide_context, do_rag], 
    chatbot, 
    queue=True, 
) 
submit_click_event = submit.click(user, [msg, chatbot], [msg, chatbot], queue=False).then( 
    bot, 
    [chatbot, temperature, top_p, top_k, repetition_penalty, hide_context, do_rag], 
    chatbot, 
    queue=True, 
) 
stop.click( 
    fn=request_cancel, 
    inputs=None, 
    outputs=None, 
    cancels=[submit_event, submit_click_event], 
    queue=False, 
) 
clear.click(lambda: None, None, chatbot, queue=False) 
vector_search_top_k.release( 
    update_retriever, 
    [vector_search_top_k, vector_rerank_top_n, do_rerank, search_method, score_threshold], 
    outputs=[langchain_status], 
) 
vector_rerank_top_n.release( 
    update_retriever, 
    inputs=[vector_search_top_k, vector_rerank_top_n, do_rerank, search_method, score_threshold], 
    outputs=[langchain_status], 
) 
do_rerank.change( 
    update_retriever, 
    inputs=[vector_search_top_k, vector_rerank_top_n, do_rerank, search_method, score_threshold], 
    outputs=[langchain_status], 
) 
search_method.change( 
    update_retriever, 
    inputs=[vector_search_top_k, vector_rerank_top_n, do_rerank, search_method, score_threshold], 
    outputs=[langchain_status], 
) 
score_threshold.change( 
    update_retriever, 
    inputs=[vector_search_top_k, vector_rerank_top_n, do_rerank, search_method, score_threshold], 
    outputs=[langchain_status], 
) 

demo.queue() 
# リモートで起動する場合は、server_name と server_port を指定 
# demo.launch(server_name='your server name', server_port='server port in int') 
# プラットフォーム上で起動する際に問題がある場合は、起動メソッドに share=True を渡すことができます: 
# demo.launch(share=True) 
# インターフェイスの公開共有可能なリンクを作成。詳細はドキュメントをご覧ください: https://gradio.app/docs/ 
demo.launch()

# gradio インターフェイスを停止するにはこのセルを実行 
demo.close()