Video-LLaVA と OpenVINO による視覚言語アシスタント#

この Jupyter ノートブックは、ローカルへのインストール後にのみ起動できます。

Video-LLaVA (投影前の配置による統一された視覚表現の学習、論文) は、単一の統合された視覚表現を通じて画像とビデオの両方を理解することで新境地を開く大規模視覚言語モデル (LVLM) です。LLaVA は画像ベースのタスクに優れていますが、Video-LLaVA はこれをビデオの動的な世界まで拡張し、両方の視覚領域にわたるシームレスな理解と推論を可能にします。これにより、静止画像でも動画でも、質問に答えたり、テキストを生成したり、その他のタスクを同じように簡単に実行できます。

このチュートリアルでは、Video-LLaVA モデルを使用してマルチモーダル・チャットボットを構築する方法について説明します。デモンストレーションの目的で、変換には Video-LLaVA-7B モデルを使用します。

このチュートリアルは次のステップで構成されます:

前提条件のインストール
入力プロセッサーとトークナイザーを準備
元のモデルのダウンロード
NNCF を使用してモデルの重みを 4 ビットと 8 ビットに圧縮
モデルを OpenVINO 中間表現 (IR) 形式に変換
OpenVINO ベースの推論パイプラインを準備
OpenVINO モデルを実行

目次:

モデルについて
要件
モデルを構築して OpenVINO IR 形式に変換
- モデル変換のためヘルパーを準備
- モデルの変換と最適化
  - `PyTorch モデルをインスタンス化
    <#instantiate-pytorch-model>`__
  - `NNCF を使用してモデルの重みを 4 ビットと 8 ビットに圧縮
    <#compress-model-weights-to-4-and-8-bits-using-nncf>`__
  - `モデルを OpenVINO IR 形式に変換
    <#convert-model-to-openvino-ir-format>`__
OpenVINO ベースの推論パイプラインを準備
モデルの推論を実行
インタラクティブなデモ

モデルについて#

Video-LLaVA は、事前学習済みの CLIP ViT-L/14 ビジュアル・エンコーダーと大規模言語モデルを単純な投影行列を使用して接続します。

モデルの詳細については、元の論文とリポジトリーを参照してください。

必要条件#

必要な依存関係をインストールします

%pip install -q torch "torchvision<0.17.0" "transformers>=4.31.0,<4.35.0" "pytorchvideo" "einops" "peft==0.6.2" --extra-index-url https://download.pytorch.org/whl/cpu 
%pip install -q opencv_python decord sentencepiece protobuf "openvino>=2023.2.0" "nncf>=2.7.0" "gradio>=4.19"

from pathlib import Path 
import sys 

repo_dir = Path("Video-LLaVA") 

if not repo_dir.exists():
     !git clone https://github.com/PKU-YuanGroup/Video-LLaVA.git 

sys.path.insert(0, str(repo_dir.resolve()))

警告: このチュートリアルには ffmpeg パッケージが必要です。システムにインストールするには、公式の FFmpeg ダウンロード・ページにアクセスしてください。

import gc 

import transformers 
from videollava.model import LlavaLlamaForCausalLM 
from videollava.constants import ( 
    DEFAULT_IMAGE_PATCH_TOKEN, 
    DEFAULT_VIDEO_PATCH_TOKEN, 
    DEFAULT_IM_START_TOKEN, 
    DEFAULT_VID_START_TOKEN, 
    DEFAULT_IM_END_TOKEN, 
    DEFAULT_VID_END_TOKEN, 
    DEFAULT_IMAGE_TOKEN, 
) 

transformers.logging.set_verbosity_error() 

model_id = "LanguageBind/Video-LLaVA-7B" 

config = transformers.AutoConfig.from_pretrained(model_id) 

tokenizer = transformers.AutoTokenizer.from_pretrained(model_id) 
model = LlavaLlamaForCausalLM.from_pretrained(model_id) 
image_tower = model.get_image_tower() 
video_tower = model.get_video_tower() 
image_tower.load_model() video_tower.load_model() 
image_processor = image_tower.image_processor 
video_processor = video_tower.video_processor 
mm_use_im_start_end = getattr(config, "mm_use_im_start_end", False) 
mm_use_im_patch_token = getattr(config, "mm_use_im_patch_token", True) 
if mm_use_im_patch_token: 
    tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True) 
    tokenizer.add_tokens([DEFAULT_VIDEO_PATCH_TOKEN], special_tokens=True) 
if mm_use_im_start_end: 
    tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True) 
    tokenizer.add_tokens([DEFAULT_VID_START_TOKEN, DEFAULT_VID_END_TOKEN], special_tokens=True) 
preprocess_fn = model.prepare_inputs_labels_for_multimodal 

del model 
gc.collect()

/home/itrushkin/.virtualenvs/videollava/lib/python3.10/site-packages/torch/cuda/__init__.py:611: UserWarning: Can't initialize NVML 
  warnings.warn("Can't initialize NVML") 
/home/itrushkin/.virtualenvs/videollava/lib/python3.10/site-packages/torch/cuda/__init__.py:740: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) 
  return torch._C._cuda_getDeviceCount() if nvml_count < 0 else nvml_count 
/home/itrushkin/.virtualenvs/videollava/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. 
  warn("The installed version of bitsandbytes was compiled without GPU support."

/home/itrushkin/.virtualenvs/videollava/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32

/home/itrushkin/.virtualenvs/videollava/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: /home/itrushkin/.virtualenvs/videollava/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator12recordStreamERKNS_7DataPtrENS0_10CUDAStreamE 
  warn(f"Failed to load image Python extension: {e}") 
/home/itrushkin/.virtualenvs/videollava/lib/python3.10/site-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in 0.14. Please use the 'torchvision.transforms.functional' module instead. 
  warnings.warn( 
/home/itrushkin/.virtualenvs/videollava/lib/python3.10/site-packages/torchvision/transforms/_transforms_video.py:25: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in 0.14. Please use the 'torchvision.transforms' module instead. 
  warnings.warn(

Loading checkpoint shards: 0%|          | 0/2 [00:00<?, ?it/s]

モデルを構築して OpenVINO IR 形式に変換#

Video-LLaVA は自己回帰トランスフォーマー生成モデルです。つまり、次の各モデルステップは、前のステップからのモデル出力に依存します。生成アプローチは、単語シーケンスの確率分布を条件付きの次の単語分布の積に分解できるという仮定に基づいています。言い換えると、モデルは、停止条件 (最大長の生成されたシーケンスまたは文字列トークンの終了が取得される) に達するまで、以前に生成されたトークンに基づいてループ内の次のトークンを予測します。予測される確率に基づいて次のトークンが選択される方法は、選択されたデコード方法によって決まります。最も一般的なデコード方法の詳細については、このブログをご覧ください。Hugging Face Transformers ライブラリーのモデル生成プロセスのエントリーポイントは、generate メソッドです。パラメーターと構成の詳細については、ドキュメントを参照してください。選択デコード方法論の柔軟性を維持するため、1 つのステップでモデル推論のみを変換します。

推論フローは最初のステップと次のステップで異なります。最初のステップでは、モデルは前処理された入力命令とビデオを受け入れ、その後、モデルの LLM ベースの部分が入力埋め込みに対して実行され、次に生成されるトークンの確率を予測します。次のステップでは、モデルはサンプリング戦略とキャッシュされたアテンション・キーと値に基づいて選択された次のトークン ID のみを受け入れます。出力側は自動回帰であるため、出力トークンの非表示状態は、その後の生成ステップごとに計算されると同じままになります。したがって、新しいトークンを生成するたびに再計算するのは無駄であるように思えます。キャッシュを使用すると、モデルは計算後に非表示の状態を保存します。モデルは各タイムステップで最後に生成された出力トークンのみを計算し、保存された出力トークンを非表示のトークンに再利用します。これにより、変圧器モデルの生成の複雑さが $O (n^{3})$ から $O (n^{2})$ に軽減されます。仕組みの詳細については、この記事を参照してください。

モデル変換のヘルパーを準備#

以下のコードは、Video-LLaVA モデルを OpenVINO 中間表現形式に変換する関数を準備します。上記の部分にモデルを分割し、各部分のサンプル入力を準備し、OpenVINO モデル・トランスフォーメーション API を使用して各部分を変換します。ov.convert_model 関数は PyTorch モデル・インスタンスを受け入れ、OpenVINO 形式でモデルを表す ov.Model オブジェクトを返します。これは、ov.compile_model を使用してデバイスにロードする準備が整っており、ov.save_model を使用してディスクに保存することもできます。

import torch 
import openvino as ov 
import nncf 
from typing import Optional, Tuple, List 

class ModelWrapper(torch.nn.Module): 
    def __init__(self, model): 
        super().__init__() 
        self.model = model 

    def forward( 
        self, 
        input_ids: torch.LongTensor = None, 
        attention_mask: Optional[torch.Tensor] = None, 
        past_key_values: Optional[List[torch.FloatTensor]] = None, 
        inputs_embeds: Optional[torch.FloatTensor] = None, 
        labels: Optional[torch.LongTensor] = None, 
    ): 
        outputs = self.model.model( 
            input_ids=input_ids, 
            attention_mask=attention_mask, 
            past_key_values=past_key_values, 
            inputs_embeds=inputs_embeds, 
            use_cache=True, 
            output_attentions=False, 
            output_hidden_states=False, 
            return_dict=True, 
        ) 

        hidden_states = outputs[0] 
        logits = self.model.lm_head(hidden_states) 

        return (logits, outputs.past_key_values) 

def set_node_names(ov_model, input_names=None, output_names=None): 
    if input_names is not None: 
        for inp, name in zip(ov_model.inputs, input_names): 
            inp.get_tensor().set_names({name}) 
    if output_names is not None: 
        for out, name in zip(ov_model.outputs, output_names): 
            out.get_tensor().set_names({name}) 

    ov_model.validate_nodes_and_infer_types() 

def cleanup_torchscript_cache(): 
    """ 
    Helper for removing cached model representation 
    """ 
    torch._C._jit_clear_class_registry() 
    torch.jit._recursive.concrete_type_store = torch.jit._recursive.ConcreteTypeStore() 
    torch.jit._state._clear_class_state() 

def convert_videollava( 
    pt_model: torch.nn.Module, 
    model_path: Path, videollava_wc_parameters: Optional[dict] = None, 
): 
    """ 
    Video-LLaVA model conversion function 

    Params: 
        pt_model: PyTorch model 
        model_path: path for saving model 
    Returns:
        None 
    """ 
    ov_out_path = Path(model_path) 
    pt_model.config.save_pretrained(ov_out_path) 
    pt_model.config.use_cache = True 
    pt_model.config.torchscript = True 
    wrapped = ModelWrapper(pt_model) 
    first_stage_model_path = ov_out_path / "videollava_input_embed.xml" 
    second_stage_model_path = ov_out_path / "videollava_with_past.xml" 

    if first_stage_model_path.exists() and second_stage_model_path.exists(): 
        print("Video-LLaVA model successfully converted") 
        del pt_model 
        return 
    example_input_first_stage = { 
        "inputs_embeds": torch.zeros((1, 307, 4096)), 
        "attention_mask": torch.ones((1, 307), dtype=torch.long), 
    } 
    outs = wrapped(**example_input_first_stage) 
    input_names = ["input_ids", "attention_mask"] 
    output_names = ["logits"] 
    for idx in range(len(outs[1])): 
        input_names.extend([f"past_key_values.{idx}.key", f"past_key_values.{idx}.value"]) 
        output_names.extend([f"present.{idx}.key", f"present.{idx}.value"]) 

    if not first_stage_model_path.exists(): 
        ov_model = ov.convert_model(wrapped, example_input=example_input_first_stage) 
        set_node_names(ov_model, output_names=output_names) 
        if videollava_wc_parameters is not None: 
            print("Applying weight compression to first stage Video-LLaVA model") 
            ov_model = nncf.compress_weights(ov_model, **videollava_wc_parameters) 
        ov.save_model(ov_model, first_stage_model_path) 
        cleanup_torchscript_cache() 
        del ov_model 
        gc.collect() 

    if not second_stage_model_path.exists(): 
        example_input_second_stage = { 
            "input_ids": torch.ones((1, 1), dtype=torch.long), 
            "attention_mask": torch.ones((1, outs[1][-1][-1].shape[-2] + 1), dtype=torch.long), 
            "past_key_values": outs[1], 
        } 
        ov_model = ov.convert_model(wrapped, example_input=example_input_second_stage) 
        set_node_names(ov_model, input_names, output_names) 

        if videollava_wc_parameters is not None: 
            print("Applying weight compression to second stage Video-LLaVA model") 
            ov_model = nncf.compress_weights(ov_model, **videollava_wc_parameters) 
        ov.save_model(ov_model, second_stage_model_path) 
        cleanup_torchscript_cache() 
        del ov_model 
        gc.collect() 
    print("Video-LLaVA model successfully converted") 
    del wrapped 
    del pt_model

INFO:nncf:NNCF initialized successfully.Supported frameworks detected : torch, openvino

モデルの変換と最適化#

当社のモデル変換と最適化は、次の手順で構成されます。

オリジナルの PyTorch モデルをダウンロードします。
NNCF を使用してモデルの重みを圧縮します。
モデルを OpenVINO 形式に変換し、ディスクに保存します。

それぞれのステップをさらに深く考えてみましょう。

PyTorch モデルをインスタンス化#

PyTorch モデルを作成するには、LlavaLlamaForCausalLM モデルクラスの from_pretrained メソッドを使用する必要があります。モデルの重みは、最初の実行時に HuggingFace ハブからダウンロードされます。これには時間がかかる場合があり、ディスク上に少なくとも 13 GB の空き容量が必要です。

NNCF を使用してモデルの重みを 4 ビットと 8 ビットに圧縮#

メモリー消費を削減するため、NNCF を使用して重み圧縮を最適化できます。重み圧縮は、モデルのメモリー使用量を削減することを目的としています。また、大規模言語モデル (LLM) など、メモリーに依存する大規模なモデルのパフォーマンスが大幅に向上する可能性もあります。LLM やその他のモデルは、推論中に重みを保存する大量のメモリーを必要とするため、次の方法で重み圧縮の利点を得られます:

デバイスのメモリーに格納できない大規模なモデルの推論を可能にします。
線形レイヤーなどの重みを使用した演算を行う際のメモリーアクセス・レイテンシーを短縮することで、モデルの推論パフォーマンスを向上させます。

ニューラル・ネットワーク圧縮フレームワーク (NNCF) は、主に LLM の最適化向けに設計された圧縮方法として、4 ビット / 8 ビット混合重み量子化を提供します。重み圧縮とフルモデル量子化 (トレーニング後の量子化) 違いは、重み圧縮のでは、活性化が浮動小数点のままであるため、精度が向上することです。LLM の重み圧縮は、完全なモデル量子化のパフォーマンスに匹敵する推論パフォーマンスの向上をもたらします。さらに、重み圧縮はデータに依存せず、キャリブレーション・データセットも必要としないため、容易に利用できます。

nncf.compress_weights 関数は重み圧縮の実行に使用できます。この関数は、OpenVINO モデルとその他の圧縮パラメーターを受け入れます。INT8 圧縮と比較して、INT4 圧縮はパフォーマンスをさらに向上させますが、予測品質は若干低下します。

重み圧縮の詳細については、OpenVINO のドキュメントを参照してください。

注: dGPU 上の INT4 圧縮モデルでは高速化は行われません。

モデルを OpenVINO IR 形式に変換#

上記で定義した変換ヘルパー関数を使用して、モデルを OpenVINO 形式に変換します。

INT8 重み圧縮の代わりに INT4 重み圧縮を実行するかどうかは、以下で選択してください。

import ipywidgets as widgets 

compression_mode = widgets.Dropdown( 
    options=["INT4", "INT8"], 
    value="INT4", 
    description="Compression mode:", 
    disabled=False, 
) 

compression_mode

Dropdown(description='Compression mode:', options=('INT4', 'INT8'), value='INT4')

if compression_mode.value == "INT4": 
    compressed_model_dir = Path("videollava/INT4_compressed_weights") 
    videollava_wc_parameters = dict(mode=nncf.CompressWeightsMode.INT4_ASYM, group_size=128, ratio=0.8) 
else: 
    compressed_model_dir = Path("videollava/INT8_compressed_weights") 
    videollava_wc_parameters = dict(mode=nncf.CompressWeightsMode.INT8) 

if not compressed_model_dir.exists(): 
    compressed_model_dir.mkdir(exist_ok=True, parents=True) 
    model = LlavaLlamaForCausalLM.from_pretrained(model_id) 
    model.resize_token_embeddings(len(tokenizer)) 

    if hasattr(config, "max_sequence_length"): 
        context_len = config.max_sequence_length 
    else: 
        context_len = 2048 
    image_tower = model.get_image_tower() 
    if not image_tower.is_loaded: 
        image_tower.load_model() 
    video_tower = model.get_video_tower() 
    if not video_tower.is_loaded: 
        video_tower.load_model() 

    model.eval() 
    with torch.no_grad(): 
        convert_videollava( 
            model, 
            compressed_model_dir, 
            videollava_wc_parameters=videollava_wc_parameters, 
        ) 
    del model 
    gc.collect();

Loading checkpoint shards: 0%|          | 0/2 [00:00<?, ?it/s]

WARNING:nncf:NNCF provides best results with torch==2.1.0, while current torch version is 2.1.2+cu121.  If you encounter issues, consider switching to torch==2.1.0 
Applying weight compression to first stage Video-LLaVA model

Output()

INFO:nncf:Statistics of the bitwidth distribution: 
+--------------+-----------------+--------------------+ 
| Num bits (N) | % all weight    | % internal weights | 
+==============+=================+====================+ 
| 8            | 22% (58 / 225)  | 20% (56 / 223)     | 
+--------------+-----------------+--------------------+ 
| 4            | 78% (167 / 225) | 80% (167 / 223)    | 
+--------------+-----------------+--------------------+

Output()

Applying weight compression to second stage Video-LLaVA model

Output()

INFO:nncf:Statistics of the bitwidth distribution: 
+--------------+-----------------+--------------------+ 
| Num bits (N) | % all weight    | % internal weights | 
+==============+=================+====================+ 
| 8            | 23% (58 / 226)  | 20% (56 / 224)     | 
+--------------+-----------------+--------------------+ 
| 4            | 77% (168 / 226) | 80% (168 / 224)    | 
+--------------+-----------------+--------------------+

Output()

Video-LLaVA model successfully converted

OpenVINO ベースの推論パイプラインを準備#

クラスは、生成シナリオでモデルを使用するのに使いやすいインターフェイスを提供します。これは、HuggingFace Transformers ライブラリーに実装されている生成のすべてのリーチ機能を再利用する可能性をもたらす、transformers.generation.GenerationMixin に基づいています。このインターフェイスの詳細については、HuggingFace のドキュメントを参照してください。

from transformers.generation import GenerationConfig, GenerationMixin 
from transformers.modeling_outputs import CausalLMOutputWithPast 
import numpy as np 
import torch 

class OVLlavaLlamaForCausalLM(GenerationMixin): 
    def __init__(self, core, model_dir, device): 
        self.model = core.read_model(model_dir / "videollava_with_past.xml") 
        self.model_input_embed = core.compile_model(model_dir / "videollava_input_embed.xml", device) 
        self.input_names = {key.get_any_name(): idx for idx, key in enumerate(self.model.inputs)} 
        self.output_names = {key.get_any_name(): idx for idx, key in enumerate(self.model.outputs)} 
        self.key_value_input_names = [key for key in self.input_names if "key_values" in key] self.key_value_output_names = [key for key in self.output_names if "present" in key] 
        compiled_model = core.compile_model(self.model, device) 
        self.request = compiled_model.create_infer_request() 
        self.config = transformers.AutoConfig.from_pretrained(model_dir) 
        self.generation_config = GenerationConfig.from_model_config(config) 
        self.main_input_name = "input_ids" 
        self.device = torch.device("cpu") 
        self.num_pkv = 2 
        self._supports_cache_class = False 

    def can_generate(self): 
        """Returns True to validate the check that the model using `GenerationMixin.generate()` can indeed generate.
        """ 
        return True 

    def __call__( 
        self, 
        input_ids: torch.LongTensor, 
        images: torch.Tensor, 
        attention_mask: Optional[torch.LongTensor] = None, 
        prefix_mask: Optional[torch.LongTensor] = None, 
        past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, 
        **kwargs, 
    ) -> CausalLMOutputWithPast: 
        return self.forward(input_ids, images, attention_mask, prefix_mask, past_key_values) 

    def forward( 
        self, 
        input_ids: torch.LongTensor, 
        images: torch.Tensor, 
        attention_mask: Optional[torch.LongTensor] = None, 
        prefix_mask: Optional[torch.LongTensor] = None, 
        past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, 
        **kwargs, 
    ) -> CausalLMOutputWithPast: 
        """General inference method""" 
        inputs = {} 
        if past_key_values is not None:
            # past_key_values をフラット化 
            attention_mask = torch.ones( 
                (input_ids.shape[0], past_key_values[-1][-1].shape[-2] + 1), 
                dtype=input_ids.dtype, 
            ) 
            past_key_values = (past_key_value for pkv_per_layer in past_key_values for past_key_value in pkv_per_layer) 
            # past_key_values をデコーダー入力に追加 
            inputs = dict(zip(self.key_value_input_names, past_key_values)) 

        else: 
            return self.forward_with_image(input_ids, images, attention_mask) inputs["input_ids"] = np.array(input_ids) 

        if "attention_mask" in self.input_names: 
            inputs["attention_mask"] = np.array(attention_mask) 

        # 推論を実行 
        self.request.start_async(inputs, share_inputs=True) 
        self.request.wait() 

        logits = torch.from_numpy(self.request.get_tensor("logits").data) 

        # 長さが等しいタプル: レイヤー数 * デコーダーレイヤーごとの past_key_value の数 (2 は自己注意レイヤーに対応) 
        past_key_values = tuple(self.request.get_tensor(key).data for key in self.key_value_output_names) 
        # 長さが `n_layers` のタプルのタプル。各タプルの長さは 2 (自己注意の k/v) に等しい 

        past_key_values = tuple(past_key_values[i : i + self.num_pkv] for i in range(0, len(past_key_values), self.num_pkv)) 
        return CausalLMOutputWithPast(logits=logits, past_key_values=past_key_values) 

    def forward_with_image(self, input_ids, images, attention_mask): 
        """First step inference method, that resolves multimodal data""" 
        _, _, attention_mask, _, input_embeds, _ = preprocess_fn( 
            input_ids=input_ids, 
            position_ids=None, 
            attention_mask=attention_mask, 
            past_key_values=None, 
            labels=None, 
            images=images, 
        ) 
        outs = self.model_input_embed({"inputs_embeds": input_embeds, "attention_mask": attention_mask}) 
        logits = outs[0] 
        pkv = list(outs.values())[1:] pkv = tuple(pkv[i : i + self.num_pkv] for i in range(0, len(pkv), self.num_pkv)) 
        return CausalLMOutputWithPast(logits=torch.from_numpy(logits), past_key_values=pkv) 

    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **kwargs): 
        """ 
        This function is used during running GenerationMixin.generate for preparing model specific inputs for 
        each generation step 
        """ 
        past_len = 0 
        if past_key_values is not None: 
            input_ids = input_ids[:, -1].unsqueeze(-1) 
            past_len = past_key_values[-1][-1].shape[-2] 
        attention_mask = kwargs.get( 
            "attention_mask", 
            torch.ones(input_ids.shape[0], input_ids.shape[1] + past_len), 
        ) 
        if not kwargs.get("use_cache", True): 
            raise NotImplementedError("MPT with prefix_lm=True does not support use_cache=False.") 
        else: 
            prefix_mask = None 
        return { 
            "input_ids": input_ids, 
            "attention_mask": attention_mask, 
            "prefix_mask": prefix_mask, 
            "past_key_values": past_key_values, 
            "images": kwargs.get("images", None), 
        } 

    def _reorder_cache(self, past_key_values: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor) -> Tuple[Tuple[torch.Tensor]]: 
        """ 
        This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or 
        [`~PreTrainedModel.beam_sample`] is called. 
        This is required to match `past_key_values` with the correct beam_idx at every generation step.
        """
 
        # transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel._reorder_cache から 
        return tuple(tuple(np.take(past_state, beam_idx, 0) for past_state in layer_past) for layer_past in past_key_values)

モデルの推論を実行#

モデルと生成パイプラインを定義したら、モデル推論を実行できます。

推論デバイスの選択#

OpenVINO を使用して推論を実行するデバイスをドロップダウン・リストから選択します。

注: dGPU 上の INT4 圧縮モデルでは高速化は行われません。

import ipywidgets as widgets 

core = ov.Core() 
device = widgets.Dropdown( 
    options=core.available_devices + ["AUTO"], 
    value="AUTO", 
    description="Device:", 
    disabled=False, 
) 

device

Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

OpenVINO モデルのロード#

ov_model = OVLlavaLlamaForCausalLM(core, compressed_model_dir, device.value)

入力データを準備#

入力データを準備するために、チュートリアルの冒頭で定義したトークナイザーとイメージ・プロセッサーを使用します。オリジナルの PyTorch 実装と整合させるために、入力として PyTorch テンソルを使用します。

from IPython.display import display, Video, Image

examples_dir = Path("Video-LLaVA/videollava/serve/examples") 
video_file = examples_dir / "sample_demo_22.mp4" 
image_file = examples_dir / "sample_img_22.png" 

video_tensor = video_processor.preprocess(str(video_file), return_tensors="pt")["pixel_values"][0] 
image_tensor = image_processor.preprocess(str(image_file), return_tensors="pt")["pixel_values"][0] 
images_tensor = [video_tensor, image_tensor] 

text_message = "Are the instruments in the pictures used in the video?" 
print(f"Question: {text_message}") 
display(Video(video_file, embed=True)) 
Image(image_file, embed=True)

Question: Are the instruments in the pictures used in the video?

../_images/videollava-multimodal-chatbot-with-output_19_2.png

モデルの推論をテスト#

長い応答の生成プロセスは時間がかかる場合があります。プロセス全体が終了するまで待たずに、生成されたらすぐに部分的な結果にアクセスするには、ストリーミング API を使用できます。トークン・ストリーミングは、モデルがトークンを生成すると、生成システムがトークンを 1 つずつ返すモードです。これにより、生成全体を待つのではなく、段階的な生成をユーザーに表示できるようになります。ストリーミングは、スムーズなエクスペリエンスの最も重要な側面の 1 つである遅延を削減するため、エンドユーザー・エクスペリエンスの重要な側面です。ストリーミングの仕組みの詳細については、HuggingFace のドキュメントをご覧ください。

また、会話モードで入力の準備を簡単にするため、モデル作成者が提供する会話テンプレート・ヘルパーを使用して、提供されたメッセージと画像の履歴を蓄積します。

from videollava.mm_utils import tokenizer_image_token, KeywordsStoppingCriteria 
from videollava.constants import IMAGE_TOKEN_INDEX 
from transformers import TextStreamer 
from videollava.conversation import conv_templates, SeparatorStyle 

# Prepare 
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) 
conv_mode = "llava_v1" 

conv = conv_templates[conv_mode].copy() 
roles = ("user", "assistant") 

if mm_use_im_start_end: 
    inp = DEFAULT_VIDEO_START_TOKEN + DEFAULT_IMAGE_TOKEN * 8 + DEFAULT_VIDEO_END_TOKEN + "\n" + text_message 
else: 
    inp = DEFAULT_IMAGE_TOKEN * 8 + "\n" + text_message 
conv.append_message(conv.roles[0], inp) 
conv.append_message(conv.roles[1], None) 

prompt = conv.get_prompt() 
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0) 

stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2 
keywords = [stop_str] 
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids) 
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) 
print("Answer:") 

output_ids = ov_model.generate( 
    input_ids, 
    images=images_tensor, 
    do_sample=True, 
    temperature=0.2, 
    max_new_tokens=1024, 
    streamer=streamer, 
    use_cache=True, 
    stopping_criteria=[stopping_criteria], 
)

Answer: 
['video', 'image'] 
Yes, the instruments in the pictures are used in the video. The man is playing a drum set, which includes a bass drum, snare drum, and cymbals. The cymbals are used to produce different sounds, such as crashes and hi-hats. The man is also seen playing a guitar, which is another instrument used in the video.

インタラクティブなデモ#

import torch 
import gradio as gr 

from videollava.constants import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX 
from videollava.conversation import conv_templates, SeparatorStyle 

def generate(image, video, textbox_in): 
    if video is not None: 
        textbox_in = DEFAULT_IMAGE_TOKEN * 8 + "\n" + textbox_in 
        if image is not None: 
            textbox_in += "\n" + DEFAULT_IMAGE_TOKEN 
    elif image is not None: 
        textbox_in = DEFAULT_IMAGE_TOKEN + "\n" + textbox_in 

    conv_mode = "llava_v1" conv = conv_templates[conv_mode].copy() 
    conv.append_message(conv.roles[0], textbox_in) 
    conv.append_message(conv.roles[1], None) 
    prompt = conv.get_prompt() 
    images_tensor = [] 
    if image is not None: 
        images_tensor.append(image_processor(image, return_tensors="pt")["pixel_values"][0]) 
    if video is not None: 
        images_tensor.append(video_processor(video, return_tensors="pt")["pixel_values"][0]) 
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0) 

    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2 
    keywords = [stop_str] 
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids) 

    generate_kwargs = dict( 
        input_ids=input_ids, 
        images=images_tensor, 
        max_new_tokens=1024, 
        temperature=0.2, 
        do_sample=True, 
        use_cache=True, 
        stopping_criteria=[stopping_criteria], 
    ) 

    output_ids = ov_model.generate(**generate_kwargs) 

    input_token_len = input_ids.shape[1] 
    outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0] 
    outputs = outputs.strip() 
    if outputs.endswith(stop_str): 
        outputs = outputs[: -len(stop_str)] 
        outputs = outputs.strip() 

        return outputs 

demo = gr.Interface( 
    generate, 
    [ 
        gr.Image(label="Input Image", type="filepath"), 
        gr.Video(label="Input Video"), 
        gr.Textbox(label="Question"), 
    ], 
    gr.Textbox(lines=10), 
    examples=[ 
        [ 
            f"{examples_dir}/extreme_ironing.jpg", 
            None, 
            "What is unusual about this image?", 
        ], 
        [ 
            f"{examples_dir}/waterview.jpg", 
            None, 
            "What are the things I should be cautious about when I visit here?", 
        ], 
        [ 
            f"{examples_dir}/desert.jpg", 
            None, 
            "If there are factual errors in the questions, point it out; if not, proceed answering the question.What’s happening in the desert?", 
        ], 
        [ 
            None, 
            f"{examples_dir}/sample_demo_1.mp4", 
            "Why is this video funny?", 
        ], 
        [ 
            None, 
            f"{examples_dir}/sample_demo_3.mp4", 
            "Can you identify any safety hazards in this video?", 
        ], 
        [ 
            None, 
            f"{examples_dir}/sample_demo_9.mp4", 
            "Describe the video.", 
        ], 
        [ 
            None, 
            f"{examples_dir}/sample_demo_22.mp4", 
            "Describe the activity in the video.", 
        ], 
        [ 
            f"{examples_dir}/sample_img_22.png", 
            f"{examples_dir}/sample_demo_22.mp4", 
            "Are the instruments in the pictures used in the video?", 
        ], 
        [ 
            f"{examples_dir}/sample_img_13.png", 
            f"{examples_dir}/sample_demo_13.mp4", 
            "Does the flag in the image appear in the video?", 
        ], 
        [ 
            f"{examples_dir}/sample_img_8.png", 
            f"{examples_dir}/sample_demo_8.mp4", 
            "Are the image and the video depicting the same place?", 
        ], 
    ], 
    title="Video-LLaVA🚀", 
    allow_flagging="never", 
) 
try: 
    demo.queue().launch(debug=False) 
except Exception: 
    demo.queue().launch(share=True, debug=False) 
# リモートで起動する場合は、server_name と server_port を指定 
# demo.launch(server_name='your server name', server_port='server port in int') 
# 詳細はドキュメントをご覧ください: https://gradio.app/docs/