Stable Diffusion XL と OpenVINO による画像生成¶

この Jupyter ノートブックは、ローカルへのインストール後にのみ起動できます。

Stable Diffusion XL または SDXL は、Stable Diffusion 2.1 など以前の Stable Diffusion モデルと比較して、より詳細な画像と構成を備えた、より写実的な出力向けに調整された最新の画像生成モデルです。

Stable Diffusion XL を使用すると、改良された顔生成によってさらにリアルな画像を作成したり、画像内に判読可能なテキストを生成したり、より短いプロンプトを使用してより美しいアートを作成できるようになりました。

パイプライン¶

SDXL は、潜在拡散のためのエキスパート・パイプラインの集合体で構成されています。最初のステップでは、基本モデルを使用して (ノイズの多い) 潜在を生成し、その後、最終的なノイズ除去ステップに特化した改良モデルでさらに処理されます。ベースモデルは、スタンドアロン・モジュールとして使用することも、次のように 2 ステージのパイプラインで使用することもできます。

まず、ベースモデルを使用して、必要な出力サイズの潜在変数を生成します。
特殊な高解像度モデルを使用し、同じプロンプトを使用して、最初のステップで生成された潜在変数に SDEdit (“イメージからイメージ” とも呼ばれます) と呼ばれる手法を適用します。

Stable Diffusion の以前のバージョンと比較して、SDXL は 3 倍の UNet バックボーンを活用します。モデル・パラメーターの増加は、主に SDXL が 2 番目のテキスト・エンコーダーを使用するため、さらに多くの注意ブロックと大きなクロスアテンション・コンテキストによるものです。複数の新しい調整スキームを設計し、複数のアスペクト比で SDXL をトレーニングし、事後画像間技術を使用して SDXL によって生成されたサンプルの視覚的忠実度を向上させるため改良モデルも導入しています。SDXL のテストでは、Stable Diffusion の以前のバージョンと比較してパフォーマンスが大幅に向上し、ブラックボックスの最先端の画像ジェネレーターに匹敵する結果が達成されました。

このチュートリアルでは、OpenVINO を使用して SDXL モデルを実行する方法について説明します。

Hugging Face Diffusers ライブラリーの事前トレーニング済みモデルを使用します。ユーザー・エクスペリエンスを簡素化するために、Hugging Face Optimum Intel ライブラリーを使用してモデルを OpenVINO™ IR 形式に変換します。

このチュートリアルは次のステップで構成されます。

前提条件をインストールします。
Hugging Face Optimum との OpenVINO 統合を使用して、パブリックソースから Stable Diffusion XL Base モデルをダウンロードします。
Stable Diffusion XL ベースを使用して Text2Image 生成パイプラインを実行します。
Stable Diffusion XL ベースを使用して Image2Image 生成パイプラインを実行します。
OpenVINO と Hugging Face Optimum の統合を使用して、パブリックソースからモデルをダウンロードして変換します。
2 段階の Stable Diffusion XL パイプラインを実行します。

注: デモモデルの中には、変換と実行に少なくとも 64 GB の RAM が必要なものもあります。

目次¶

前提条件のインストール
SDXL ベースモデル
リファイナー・モデル
- 推論デバイスの選択
- Refinement で Text2Image 生成を実行

前提条件のインストール¶

                                        %pip install -q --extra-index-url https://download.pytorch.org/whl/cpu "diffusers>=0.18.0" "invisible-watermark>=0.2.0" "transformers>=4.33.0" "accelerate" "onnx"
%pip install -q "git+https://github.com/huggingface/optimum-intel.git"
%pip install -q "openvino>=2023.1.0" gradio

                                    

SDXL ベースモデル¶

まず、目的とする出力サイズの画像を生成するベースモデルから始めます。stable-diffusion-xl-base-1.0 は、Hugging Face Hub からダウンロードできます。Optimum Intel と互換性のある OpenVINO 形式ですぐに使用できるモデルがすでに提供されています。

OpenVINO モデルをロードして OpenVINO ランタイムで推論を実行するには、Diffusers の StableDiffusionXLPipeline を Optimum の OVStableDiffusionXLPipeline に置き換える必要があります。PyTorch モデルをロードして、その場で OpenVINO 形式に変換する場合は、export=True を設定できます。

save_pretrained メソッドを使用してモデルをディスクに保存できます。

                                        from pathlib import Path
from optimum.intel.openvino import OVStableDiffusionXLPipeline
import gc

model_id = "stabilityai/stable-diffusion-xl-base-1.0"
model_dir = Path("openvino-sd-xl-base-1.0")

                                    

                                        INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino

                                    

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
2023-09-19 18:52:15.570335: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-09-19 18:52:15.609718: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-19 18:52:16.242994: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/ea/work/ov_venv/lib/python3.8/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(

推論デバイス SDXL ベースモデルを選択¶

OpenVINO を使用して推論を実行するためにドロップダウン・リストからデバイスを選択します。

                                            import ipywidgets as widgets
import openvino as ov

core = ov.Core()

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='AUTO',
    description='Device:',
    disabled=False,
)


device

                                        

                                            Dropdown(description='Device:', index=2, options=('CPU', 'GPU', 'AUTO'), value='AUTO')

                                        

                                            if not model_dir.exists():
    text2image_pipe = OVStableDiffusionXLPipeline.from_pretrained(model_id, compile=False, device=device.value)
    text2image_pipe.half()
    text2image_pipe.save_pretrained(model_dir)
    text2image_pipe.compile()
else:
    text2image_pipe = OVStableDiffusionXLPipeline.from_pretrained(model_dir, device=device.value)

                                        

                                            Compiling the vae_decoder...
Compiling the unet...
Compiling the text_encoder...
Compiling the vae_encoder...
Compiling the text_encoder_2...

                                        

Text2Image 生成パイプラインを実行¶

これで、テキストプロンプトを使用して画像を生成するモデルを実行できます。評価を高速化し、使用するメモリーを削減するため、num_inference_steps と画像サイズ (高さと幅を使用) を減らします。ニーズに合わせて、またターゲット・ハードウェアに応じて変更できます。また、結果の再現性を確保するため、特定のシードを持つ NumPy ランダム状態に基づくジェネレーターのパラメーターも指定しました。

                                            import numpy as np

prompt = "cute cat 4k, high-res, masterpiece, best quality, soft lighting, dynamic angle"
image = text2image_pipe(prompt, num_inference_steps=15, height=512, width=512, generator=np.random.RandomState(314)).images[0]
image.save("cat.png")
image

                                        

/home/ea/work/ov_venv/lib/python3.8/site-packages/optimum/intel/openvino/modeling_diffusion.py:559: FutureWarning: shared_memory is deprecated and will be removed in 2024.0. Value of shared_memory is going to override share_inputs value. Please use only share_inputs explicitly.
  outputs = self.request(inputs, shared_memory=True)

0%|          | 0/15 [00:00<?, ?it/s]

/home/ea/work/ov_venv/lib/python3.8/site-packages/optimum/intel/openvino/modeling_diffusion.py:590: FutureWarning: shared_memory is deprecated and will be removed in 2024.0. Value of shared_memory is going to override share_inputs value. Please use only share_inputs explicitly.
  outputs = self.request(inputs, shared_memory=True)
/home/ea/work/ov_venv/lib/python3.8/site-packages/optimum/intel/openvino/modeling_diffusion.py:606: FutureWarning: shared_memory is deprecated and will be removed in 2024.0. Value of shared_memory is going to override share_inputs value. Please use only share_inputs explicitly.
  outputs = self.request(inputs, shared_memory=True)

../_images/248-stable-diffusion-xl-with-output_10_3.png

Text2image 生成インタラクティブ・デモ¶

                                            import gradio as gr

if text2image_pipe is None:
    text2image_pipe = OVStableDiffusionXLPipeline.from_pretrained(model_dir, device=device.value)

prompt = "cute cat 4k, high-res, masterpiece, best quality, soft lighting, dynamic angle"

def generate_from_text(text, seed, num_steps):
    result = text2image_pipe(text, num_inference_steps=num_steps, generator=np.random.RandomState(seed), height=512, width=512).images[0]
    return result


with gr.Blocks() as demo:
    with gr.Column():
        positive_input = gr.Textbox(label="Text prompt")
        with gr.Row():
            seed_input = gr.Number(precision=0, label="Seed", value=42, minimum=0)
            steps_input = gr.Slider(label="Steps", value=10)
            btn = gr.Button()
        out = gr.Image(label="Result", type="pil", width=512)
        btn.click(generate_from_text, [positive_input, seed_input, steps_input], out)
        gr.Examples([
            [prompt, 999, 20],
            ["underwater world coral reef, colorful jellyfish, 35mm, cinematic lighting, shallow depth of field,  ultra quality, masterpiece, realistic", 89, 20],
            ["a photo realistic happy white poodle dog ​​playing in the grass, extremely detailed, high res, 8k, masterpiece, dynamic angle", 1569, 15],
            ["Astronaut on Mars watching sunset, best quality, cinematic effects,", 65245, 12],
            ["Black and white street photography of a rainy night in New York, reflections on wet pavement", 48199, 10]
        ], [positive_input, seed_input, steps_input])

# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/
# if you want create public link for sharing demo, please add share=True
demo.launch()

                                        

Running on local URL:  http://127.0.0.1:7860

To create a public link, set share=True in launch().

                                            demo.close()
text2image_pipe = None
gc.collect();

                                        

                                            Closing server running on port: 7860

                                        

Image2Image 生成パイプラインを実行¶

変換済みのモデルを再利用して、Image2Image 生成パイプラインを実行することができます。それには、OVStableDiffusionXLPipeline を OVStableDiffusionXLImage2ImagePipeline に置き換える必要があります。

推論デバイス SDXL リファイナー・モデルを選択¶

OpenVINO を使用して推論を実行するためにドロップダウン・リストからデバイスを選択します。

device

                                    Dropdown(description='Device:', index=2, options=('CPU', 'GPU', 'AUTO'), value='AUTO')

                                

                                    from optimum.intel import OVStableDiffusionXLImg2ImgPipeline

image2image_pipe = OVStableDiffusionXLImg2ImgPipeline.from_pretrained(model_dir, device=device.value)

                                    Compiling the vae_decoder...
Compiling the unet...
Compiling the text_encoder_2...
Compiling the vae_encoder...
Compiling the text_encoder...

                                

                                    photo_prompt = "professional photo of a cat, extremely detailed, hyper realistic, best quality, full hd"
photo_image = image2image_pipe(photo_prompt, image=image, num_inference_steps=25, generator=np.random.RandomState(356)).images[0]
photo_image.save("photo_cat.png")
photo_image

                                

/home/ea/work/ov_venv/lib/python3.8/site-packages/optimum/intel/openvino/modeling_diffusion.py:559: FutureWarning: shared_memory is deprecated and will be removed in 2024.0. Value of shared_memory is going to override share_inputs value. Please use only share_inputs explicitly.
  outputs = self.request(inputs, shared_memory=True)
/home/ea/work/ov_venv/lib/python3.8/site-packages/optimum/intel/openvino/modeling_diffusion.py:622: FutureWarning: shared_memory is deprecated and will be removed in 2024.0. Value of shared_memory is going to override share_inputs value. Please use only share_inputs explicitly.
  outputs = self.request(inputs, shared_memory=True)

0%|          | 0/7 [00:00<?, ?it/s]

/home/ea/work/ov_venv/lib/python3.8/site-packages/optimum/intel/openvino/modeling_diffusion.py:590: FutureWarning: shared_memory is deprecated and will be removed in 2024.0. Value of shared_memory is going to override share_inputs value. Please use only share_inputs explicitly.
  outputs = self.request(inputs, shared_memory=True)
/home/ea/work/ov_venv/lib/python3.8/site-packages/optimum/intel/openvino/modeling_diffusion.py:606: FutureWarning: shared_memory is deprecated and will be removed in 2024.0. Value of shared_memory is going to override share_inputs value. Please use only share_inputs explicitly.
  outputs = self.request(inputs, shared_memory=True)

../_images/248-stable-diffusion-xl-with-output_18_3.png

                                    import gradio as gr
from diffusers.utils import load_image
import numpy as np


load_image(
    "https://huggingface.co/datasets/optimum/documentation-images/resolve/main/intel/openvino/sd_xl/castle_friedrich.png"
).resize((512, 512)).save("castle_friedrich.png")


if image2image_pipe is None:
    image2image_pipe = OVStableDiffusionXLImg2ImgPipeline.from_pretrained(model_dir)

def generate_from_image(text, image, seed, num_steps):
    result = image2image_pipe(text, image=image, num_inference_steps=num_steps, generator=np.random.RandomState(seed)).images[0]
    return result


with gr.Blocks() as demo:
    with gr.Column():
        positive_input = gr.Textbox(label="Text prompt")
        with gr.Row():
            seed_input = gr.Number(precision=0, label="Seed", value=42, minimum=0)
            steps_input = gr.Slider(label="Steps", value=10)
            btn = gr.Button()
        with gr.Row():
            i2i_input = gr.Image(label="Input image", type="pil")
            out = gr.Image(label="Result", type="pil", width=512)
        btn.click(generate_from_image, [positive_input, i2i_input, seed_input, steps_input], out)
        gr.Examples([
            ["amazing landscape from legends", "castle_friedrich.png", 971, 60],
            ["Masterpiece of watercolor painting in Van Gogh style", "cat.png", 37890, 40]
        ], [positive_input, i2i_input, seed_input, steps_input])

# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/
# if you want create public link for sharing demo, please add share=True
demo.launch()

                                

Running on local URL:  http://127.0.0.1:7860

To create a public link, set share=True in launch().

                                    demo.close()
del image2image_pipe
gc.collect()

                                

                                    Closing server running on port: 7860

                                

SDXL リファイナー・モデル¶

前述したように、Stable Diffusion XL は 2 段階のアプローチで使用できます。まず、ベースモデルを使用して必要な出力サイズの潜在変数を生成します。次に、同じプロンプトを使用して、最初のステップで生成された潜在変数を改良する特殊な高解像度モデルを使用します。Stable Diffusion XL Refiner モデルは、ユーザーが指定したプロンプトテキストを利用して、通常の画像を見事な傑作に変換するように設計されています。Stable Diffusion XL Base 後の画像生成の品質を向上させるのに使用できます。リファイナー・モデルは、SDXL ベースモデルによって生成された潜在変数と、生成された画像を改善するためのテキストプロンプトを受け入れます。

                                        from optimum.intel import OVStableDiffusionXLImg2ImgPipeline, OVStableDiffusionXLPipeline
from pathlib import Path

refiner_model_id = "stabilityai/stable-diffusion-xl-refiner-1.0"
refiner_model_dir = Path("openvino-sd-xl-refiner-1.0")


if not refiner_model_dir.exists():
    refiner = OVStableDiffusionXLImg2ImgPipeline.from_pretrained(refiner_model_id, export=True, compile=False)
    refiner.half()
    refiner.save_pretrained(refiner_model_dir)
    del refiner
    gc.collect()

                                    

推論デバイスの選択¶

OpenVINO を使用して推論を実行するためにドロップダウン・リストからデバイスを選択します。

device

                                            Dropdown(description='Device:', index=2, options=('CPU', 'GPU', 'AUTO'), value='AUTO')

                                        

Refinement で Text2Image 生成を実行¶

                                            import numpy as np
import gc
model_dir = Path("openvino-sd-xl-base-1.0")
base = OVStableDiffusionXLPipeline.from_pretrained(model_dir, device=device.value)
prompt = "cute cat 4k, high-res, masterpiece, best quality, soft lighting, dynamic angle"
latents = base(prompt, num_inference_steps=15, height=512, width=512, generator=np.random.RandomState(314), output_type="latent").images[0]

del base
gc.collect()

                                        

Compiling the vae_decoder...
Compiling the unet...
Compiling the text_encoder_2...
Compiling the text_encoder...
Compiling the vae_encoder...
/home/ea/work/ov_venv/lib/python3.8/site-packages/optimum/intel/openvino/modeling_diffusion.py:559: FutureWarning: shared_memory is deprecated and will be removed in 2024.0. Value of shared_memory is going to override share_inputs value. Please use only share_inputs explicitly.
  outputs = self.request(inputs, shared_memory=True)

0%|          | 0/15 [00:00<?, ?it/s]

/home/ea/work/ov_venv/lib/python3.8/site-packages/optimum/intel/openvino/modeling_diffusion.py:590: FutureWarning: shared_memory is deprecated and will be removed in 2024.0. Value of shared_memory is going to override share_inputs value. Please use only share_inputs explicitly.
  outputs = self.request(inputs, shared_memory=True)

                                            refiner = OVStableDiffusionXLImg2ImgPipeline.from_pretrained(refiner_model_dir, device=device.value)

                                        

                                            Compiling the vae_decoder...
Compiling the unet...
Compiling the text_encoder_2...
Compiling the vae_encoder...

                                        

                                            image = refiner(prompt=prompt, image=np.transpose(latents[None, :], (0, 2, 3, 1)), num_inference_steps=15, generator=np.random.RandomState(314)).images[0]
image.save("cat_refined.png")

image

0%|          | 0/4 [00:00<?, ?it/s]

/home/ea/work/ov_venv/lib/python3.8/site-packages/optimum/intel/openvino/modeling_diffusion.py:606: FutureWarning: shared_memory is deprecated and will be removed in 2024.0. Value of shared_memory is going to override share_inputs value. Please use only share_inputs explicitly.
  outputs = self.request(inputs, shared_memory=True)

../_images/248-stable-diffusion-xl-with-output_29_2.png