FreeVC と OpenVINO™ による高品質のテキストフリーのワンショット音声変換

この Jupyter ノートブックは、ローカルへのインストール後にのみ起動できます。

GitHub

FreeVC を使用すると、テキスト・アノテーションを付けず、言語コンテンツを変更することなく、ソースの話者の音声をターゲットスタイルに変更できます。

以下の図は、推論用の FreeVC のモデル・アーキテクチャーを示しています。このノートでは推論部分のみに焦点を当てます。優先エンコーダー、スピーカー・エンコーダー、デコーダの 3 つの主要な部分があります。優先エンコーダには、WavLM モデル、ボトルネック抽出器、および正規化フローが含まれています。詳細については、この文書を参照してください。

Inference

推論

**画像出展*

FreeVC は、コマンドライン・インターフェイスを使用し、CUDA のみを使用することを提案します。このノートブックでは、CUDA デバイスを使用せずに Python で FreeVC を使用する方法を示します。これは次の手順で構成されます。

  • モデルをダウンロードして準備します。

  • 推論を実行します。

  • モデルを OpenVINO 中間表現に変換します。

  • OpenVINO の IR モデルのみを使用した推論を実行します。

目次

必要条件

この手順は手動で行うことも、ノートブックの実行中に自動的に実行されることもありますが、必要最小限の範囲で実行されます。

  1. このリポジトリーのクローンを作成します: git clone https://github.com/OlaWod/FreeVC.git
  2. WavLM-Large をダウンロードし、FreeVC/wavlm/ ディレクトリーの下に置きます。
  3. VCTK データセットをダウンロードできます。この例では、Hugging Face FreeVC の例から 2 つだけをダウンロードします。
  4. 事前トレーニングされたモデルをダウンロードし、ディレクトリー ‘checkpoints’ の下に置きます (この例では、freevc.pth のみが必要です)。

追加の要件をインストールします。

%pip install -q "openvino>=2023.3.0" "librosa>=0.8.1" "webrtcvad==2.0.10" gradio "torch>=2.1" torchvision --extra-index-url https://download.pytorch.org/whl/cpu
DEPRECATION: pytorch-lightning 1.6.5 has a non-standard dependency specifier torch>=1.8.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063
Note: you may need to restart the kernel to use updated packages.

FreeVC がインストールされているかどうかを確認し、そのパスを sys.path に追加します。

from pathlib import Path
import sys


free_vc_repo = 'FreeVC'
if not Path(free_vc_repo).exists():
    !git clone https://github.com/OlaWod/FreeVC.git

sys.path.append(free_vc_repo)
Cloning into 'FreeVC'...
remote: Enumerating objects: 131, done.
remote: Counting objects:   1% (1/65)

remote: Enumerating objects: 131, done.
remote: Counting objects:   1% (1/65)
remote: Counting objects:   3% (2/65)
remote: Counting objects:   4% (3/65)
remote: Counting objects:   6% (4/65)
remote: Counting objects:   7% (5/65)
remote: Counting objects:   9% (6/65)
remote: Counting objects:  10% (7/65)
remote: Counting objects:  12% (8/65)
remote: Counting objects:  13% (9/65)
remote: Counting objects:  15% (10/65)
remote: Counting objects:  16% (11/65)
remote: Counting objects:  18% (12/65)
remote: Counting objects:  20% (13/65)
remote: Counting objects:  21% (14/65)
remote: Counting objects:  23% (15/65)
remote: Counting objects:  24% (16/65)
remote: Counting objects:  26% (17/65)
remote: Counting objects:  27% (18/65)
remote: Counting objects:  29% (19/65)
remote: Counting objects:  30% (20/65)
remote: Counting objects:  32% (21/65)
remote: Counting objects:  33% (22/65)
remote: Counting objects:  35% (23/65)
remote: Counting objects:  36% (24/65)
remote: Counting objects:  38% (25/65)
remote: Counting objects:  40% (26/65)
remote: Counting objects:  41% (27/65)
remote: Counting objects:  43% (28/65)
remote: Counting objects:  44% (29/65)
remote: Counting objects:  46% (30/65)
remote: Counting objects:  47% (31/65)
remote: Counting objects:  49% (32/65)
remote: Counting objects:  50% (33/65)
remote: Counting objects:  52% (34/65)
remote: Counting objects:  53% (35/65)
remote: Counting objects:  55% (36/65)
remote: Counting objects:  56% (37/65)
remote: Counting objects:  58% (38/65)
remote: Counting objects:  60% (39/65)
remote: Counting objects:  61% (40/65)
remote: Counting objects:  63% (41/65)
remote: Counting objects:  64% (42/65)
remote: Counting objects:  66% (43/65)
remote: Counting objects:  67% (44/65)
remote: Counting objects:  69% (45/65)
remote: Counting objects:  70% (46/65)
remote: Counting objects:  72% (47/65)
remote: Counting objects:  73% (48/65)
remote: Counting objects:  75% (49/65)
remote: Counting objects:  76% (50/65)
remote: Counting objects:  78% (51/65)
remote: Counting objects:  80% (52/65)
remote: Counting objects:  81% (53/65)
remote: Counting objects:  83% (54/65)
remote: Counting objects:  84% (55/65)
remote: Counting objects:  86% (56/65)
remote: Counting objects:  87% (57/65)
remote: Counting objects:  89% (58/65)
remote: Counting objects:  90% (59/65)
remote: Counting objects:  92% (60/65)
remote: Counting objects:  93% (61/65)
remote: Counting objects:  95% (62/65)
remote: Counting objects:  96% (63/65)
remote: Counting objects:  98% (64/65)
remote: Counting objects: 100% (65/65)
remote: Counting objects: 100% (65/65), done.

remote: Compressing objects:   2% (1/41)
remote: Compressing objects:   4% (2/41)
remote: Compressing objects:   7% (3/41)
remote: Compressing objects:   9% (4/41)
remote: Compressing objects:  12% (5/41)
remote: Compressing objects:  14% (6/41)
remote: Compressing objects:  17% (7/41)
remote: Compressing objects:  19% (8/41)
remote: Compressing objects:  21% (9/41)
remote: Compressing objects:  24% (10/41)
remote: Compressing objects:  26% (11/41)
remote: Compressing objects:  29% (12/41)
remote: Compressing objects:  31% (13/41)
remote: Compressing objects:  34% (14/41)
remote: Compressing objects:  36% (15/41)
remote: Compressing objects:  39% (16/41)
remote: Compressing objects:  41% (17/41)
remote: Compressing objects:  43% (18/41)
remote: Compressing objects:  46% (19/41)
remote: Compressing objects:  48% (20/41)
remote: Compressing objects:  51% (21/41)
remote: Compressing objects:  53% (22/41)
remote: Compressing objects:  56% (23/41)
remote: Compressing objects:  58% (24/41)
remote: Compressing objects:  60% (25/41)
remote: Compressing objects:  63% (26/41)
remote: Compressing objects:  65% (27/41)
remote: Compressing objects:  68% (28/41)
remote: Compressing objects:  70% (29/41)
remote: Compressing objects:  73% (30/41)
remote: Compressing objects:  75% (31/41)
remote: Compressing objects:  78% (32/41)
remote: Compressing objects:  80% (33/41)
remote: Compressing objects:  82% (34/41)
remote: Compressing objects:  85% (35/41)
remote: Compressing objects:  87% (36/41)
remote: Compressing objects:  90% (37/41)
remote: Compressing objects:  92% (38/41)
remote: Compressing objects:  95% (39/41)
remote: Compressing objects:  97% (40/41)
remote: Compressing objects: 100% (41/41)
remote: Compressing objects: 100% (41/41), done.

Receiving objects:   0% (1/131)

Receiving objects: 1% (2/131) Receiving objects: 2% (3/131) Receiving objects: 3% (4/131) Receiving objects: 4% (6/131) Receiving objects: 5% (7/131) Receiving objects: 6% (8/131) Receiving objects: 7% (10/131) Receiving objects: 8% (11/131) Receiving objects: 9% (12/131) Receiving objects: 10% (14/131) Receiving objects: 11% (15/131) Receiving objects: 12% (16/131) Receiving objects: 13% (18/131) Receiving objects: 14% (19/131) Receiving objects: 15% (20/131) Receiving objects: 16% (21/131) Receiving objects: 17% (23/131) Receiving objects: 18% (24/131) Receiving objects: 19% (25/131) Receiving objects: 20% (27/131) Receiving objects: 21% (28/131) Receiving objects: 22% (29/131) Receiving objects: 23% (31/131) Receiving objects: 24% (32/131) Receiving objects: 25% (33/131) Receiving objects: 26% (35/131) Receiving objects: 27% (36/131) Receiving objects: 28% (37/131) Receiving objects: 29% (38/131) Receiving objects: 30% (40/131) Receiving objects: 31% (41/131) Receiving objects: 32% (42/131) Receiving objects: 33% (44/131) Receiving objects: 34% (45/131) Receiving objects: 34% (45/131), 3.43 MiB | 3.36 MiB/s Receiving objects: 34% (45/131), 7.09 MiB | 3.45 MiB/s Receiving objects: 34% (45/131), 10.62 MiB | 3.46 MiB/s Receiving objects: 34% (45/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 35% (46/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 36% (48/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 37% (49/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 38% (50/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 39% (52/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 40% (53/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 41% (54/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 42% (56/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 43% (57/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 44% (58/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 45% (59/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 46% (61/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 47% (62/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 48% (63/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 49% (65/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 50% (66/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 51% (67/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 52% (69/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 53% (70/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 54% (71/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 55% (73/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 56% (74/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 57% (75/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 58% (76/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 59% (78/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 60% (79/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 61% (80/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 62% (82/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 63% (83/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 64% (84/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 65% (86/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 66% (87/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 67% (88/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 68% (90/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 69% (91/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 70% (92/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 71% (94/131), 14.22 MiB | 3.47 MiB/s remote: Total 131 (delta 39), reused 24 (delta 24), pack-reused 66 Receiving objects: 72% (95/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 73% (96/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 74% (97/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 75% (99/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 76% (100/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 77% (101/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 78% (103/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 79% (104/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 80% (105/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 81% (107/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 82% (108/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 83% (109/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 84% (111/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 85% (112/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 86% (113/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 87% (114/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 88% (116/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 89% (117/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 90% (118/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 91% (120/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 92% (121/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 93% (122/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 94% (124/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 95% (125/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 96% (126/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 97% (128/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 98% (129/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 99% (130/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 100% (131/131), 14.22 MiB | 3.47 MiB/s Receiving objects: 100% (131/131), 15.28 MiB | 3.49 MiB/s, done. Resolving deltas: 0% (0/43) Resolving deltas: 20% (9/43) Resolving deltas: 25% (11/43) Resolving deltas: 51% (22/43) Resolving deltas: 62% (27/43) Resolving deltas: 67% (29/43) Resolving deltas: 74% (32/43) Resolving deltas: 79% (34/43) Resolving deltas: 81% (35/43) Resolving deltas: 86% (37/43) Resolving deltas: 88% (38/43) Resolving deltas: 97% (42/43) Resolving deltas: 100% (43/43) Resolving deltas: 100% (43/43), done.
sys.path.append("../utils")
from notebook_utils import download_file

wavlm_large_dir_path = Path('FreeVC/wavlm')
wavlm_large_path = wavlm_large_dir_path / 'WavLM-Large.pt'

if not wavlm_large_path.exists():
    download_file(
        'https://valle.blob.core.windows.net/share/wavlm/WavLM-Large.pt?sv=2020-08-04&st=2023-03-01T07%3A51%3A05Z&se=2033-03-02T07%3A51%3A00Z&sr=c&sp=rl&sig=QJXmSJG9DbMKf48UDIU1MfzIro8HQOf3sqlNXiflY1I%3D',
        directory=wavlm_large_dir_path
    )
FreeVC/wavlm/WavLM-Large.pt:   0%|          | 0.00/1.18G [00:00<?, ?B/s]
freevc_chpt_dir = Path('checkpoints')
freevc_chpt_name = 'freevc.pth'
freevc_chpt_path = freevc_chpt_dir / freevc_chpt_name

if not freevc_chpt_path.exists():
    download_file(
        f'https://storage.openvinotoolkit.org/repositories/openvino_notebooks/models/freevc/{freevc_chpt_name}',
        directory=freevc_chpt_dir
    )
checkpoints/freevc.pth:   0%|          | 0.00/451M [00:00<?, ?B/s]
audio1_name = 'p225_001.wav'
audio1_url = f'https://huggingface.co/spaces/OlaWod/FreeVC/resolve/main/{audio1_name}'
audio2_name = 'p226_002.wav'
audio2_url = f'https://huggingface.co/spaces/OlaWod/FreeVC/resolve/main/{audio2_name}'

if not Path(audio1_name).exists():
    download_file(audio1_url)

if not Path(audio2_name).exists():
    download_file(audio2_url)
p225_001.wav:   0%|          | 0.00/50.8k [00:00<?, ?B/s]
p226_002.wav:   0%|          | 0.00/135k [00:00<?, ?B/s]

インポートと設定

import logging
import os
import time

import librosa
import numpy as np
import torch
from scipy.io.wavfile import write
from tqdm import tqdm

import openvino as ov

import utils
from models import SynthesizerTrn
from speaker_encoder.voice_encoder import SpeakerEncoder
from wavlm import WavLM, WavLMConfig

logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

CUDA を除外するために utils から関数 get_model を再定義します。

def get_cmodel():
    checkpoint = torch.load(wavlm_large_path)
    cfg = WavLMConfig(checkpoint['cfg'])
    cmodel = WavLM(cfg)
    cmodel.load_state_dict(checkpoint['model'])
    cmodel.eval()

    return cmodel

モデルを初期化します。

hps = utils.get_hparams_from_file('FreeVC/configs/freevc.json')
os.makedirs('outputs/freevc', exist_ok=True)

net_g = SynthesizerTrn(
    hps.data.filter_length // 2 + 1,
    hps.train.segment_size // hps.data.hop_length,
    **hps.model
)

utils.load_checkpoint(freevc_chpt_path, net_g, optimizer=None, strict=True)
cmodel = get_cmodel()
smodel = SpeakerEncoder('FreeVC/speaker_encoder/ckpt/pretrained_bak_5805000.pt', device='cpu')
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
                                        warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Loaded the voice encoder model on cpu in 0.00 seconds.

データセットの設定を読み取ります。

srcs = [audio1_name, audio2_name]
tgts = [audio2_name, audio1_name]

推論を実行します。

with torch.no_grad():
    for line in tqdm(zip(srcs, tgts)):
        src, tgt = line
        # tgt
        wav_tgt, _ = librosa.load(tgt, sr=hps.data.sampling_rate)
        wav_tgt, _ = librosa.effects.trim(wav_tgt, top_db=20)

        g_tgt = smodel.embed_utterance(wav_tgt)
        g_tgt = torch.from_numpy(g_tgt).unsqueeze(0)

        # src
        wav_src, _ = librosa.load(src, sr=hps.data.sampling_rate)
        wav_src = torch.from_numpy(wav_src).unsqueeze(0)

        c = utils.get_content(cmodel, wav_src)

        tgt_audio = net_g.infer(c, g=g_tgt)
        tgt_audio = tgt_audio[0][0].data.cpu().float().numpy()

        timestamp = time.strftime("%m-%d_%H-%M", time.localtime())
        write(os.path.join('outputs/freevc', "{}.wav".format(timestamp)), hps.data.sampling_rate,
              tgt_audio)
0it [00:00, ?it/s]
1it [00:00,  1.06it/s]
2it [00:01,  1.25it/s]
2it [00:01,  1.22it/s]

結果のオーディオファイルは ‘outputs/freevc’ で利用できます。

モデルを OpenVINO 中間表現に変換

各モデルを FP16 精度で OpenVINO IR に変換します。ov.convert_model 関数は、元の PyTorch モデル・オブジェクトとトレース用の入力例を受け入れ、このモデルを表す OpenVINO Model クラスのインスタンスを返します。取得したモデルはすぐに使用でき、compile_model を使用してデバイスにロードするか、ov.save_model 関数でディスクに保存できます。read_model メソッドは、保存されたモデルをディスクからロードします。モデル変換の詳細については、このページを参照してください。

最初に、以前のエンコーダーを OpenVINO IR 形式に変換する一環として、WavLM モデルを変換します。モデルの元の名前をコード内に保持します: cmodel

# define forward as extract_features for compatibility
cmodel.forward = cmodel.extract_features
OUTPUT_DIR = Path("output")
BASE_MODEL_NAME = "cmodel"

OUTPUT_DIR.mkdir(exist_ok=True)

ir_cmodel_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "_ir")).with_suffix(".xml")

length = 32000

dummy_input = torch.randn(1, length)

OpenVINO の IR 形式に変換します。

core = ov.Core()

class ModelWrapper(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model

    def forward(self, input):
        return self.model(input)[0]

if not ir_cmodel_path.exists():
    ir_cmodel = ov.convert_model(ModelWrapper(cmodel), example_input=dummy_input)
    ov.save_model(ir_cmodel, ir_cmodel_path)
else:
    ir_cmodel = core.read_model(ir_cmodel_path)
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/notebooks/242-freevc-voice-conversion/FreeVC/wavlm/modules.py:495: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert embed_dim == self.embed_dim
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/notebooks/242-freevc-voice-conversion/FreeVC/wavlm/modules.py:496: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert list(query.size()) == [tgt_len, bsz, embed_dim]
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/notebooks/242-freevc-voice-conversion/FreeVC/wavlm/modules.py:500: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert key_bsz == bsz
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/notebooks/242-freevc-voice-conversion/FreeVC/wavlm/modules.py:502: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert src_len, bsz == value.shape[:2]

OpenVINO を使用して推論を実行するデバイスをドロップダウン・リストから選択します。

import ipywidgets as widgets

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='AUTO',
    description='Device:',
    disabled=False,
)

device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')
compiled_cmodel = core.compile_model(ir_cmodel, device.value)
OUTPUT_DIR = Path("output")
BASE_MODEL_NAME = "smodel"

OUTPUT_DIR.mkdir(exist_ok=True)

ir_smodel_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "ir")).with_suffix(".xml")


length = 32000

dummy_input = torch.randn(1, length, 40)

if not ir_smodel_path.exists():
    ir_smodel = ov.convert_model(smodel, example_input=dummy_input)
    ov.save_model(ir_smodel, ir_smodel_path)
else:
    ir_smodel = core.read_model(ir_smodel_path)

推論の入力を準備するには、speaker_encoder.voice_encoder.SpeakerEncoder クラスメソッドに基づいてヘルパー関数を定義する必要があります。

from speaker_encoder.hparams import sampling_rate, mel_window_step, partials_n_frames
from speaker_encoder import audio


def compute_partial_slices(n_samples: int, rate, min_coverage):
    """
    Computes where to split an utterance waveform and its corresponding mel spectrogram to
    obtain partial utterances of <partials_n_frames> each. Both the waveform and the
    mel spectrogram slices are returned, so as to make each partial utterance waveform
    correspond to its spectrogram.

    The returned ranges may be indexing further than the length of the waveform. It is
    recommended that you pad the waveform with zeros up to wav_slices[-1].stop.

    :param n_samples: the number of samples in the waveform
    :param rate: how many partial utterances should occur per second. Partial utterances must
    cover the span of the entire utterance, thus the rate should not be lower than the inverse
    of the duration of a partial utterance. By default, partial utterances are 1.6s long and
    the minimum rate is thus 0.625.
    :param min_coverage: when reaching the last partial utterance, it may or may not have
    enough frames. If at least <min_pad_coverage> of <partials_n_frames> are present,
    then the last partial utterance will be considered by zero-padding the audio. Otherwise,
    it will be discarded. If there aren't enough frames for one partial utterance,
    this parameter is ignored so that the function always returns at least one slice.
    :return: the waveform slices and mel spectrogram slices as lists of array slices. Index
    respectively the waveform and the mel spectrogram with these slices to obtain the partial
    utterances.
    """
    assert 0 < min_coverage <= 1

    # Compute how many frames separate two partial utterances
    samples_per_frame = int((sampling_rate * mel_window_step / 1000))
    n_frames = int(np.ceil((n_samples + 1) / samples_per_frame))
    frame_step = int(np.round((sampling_rate / rate) / samples_per_frame))
    assert 0 < frame_step, "The rate is too high"
    assert frame_step <= partials_n_frames, "The rate is too low, it should be %f at least" % \
        (sampling_rate / (samples_per_frame * partials_n_frames))

    # Compute the slices
    wav_slices, mel_slices = [], []
    steps = max(1, n_frames - partials_n_frames + frame_step + 1)
    for i in range(0, steps, frame_step):
        mel_range = np.array([i, i + partials_n_frames])
        wav_range = mel_range * samples_per_frame
        mel_slices.append(slice(*mel_range))
        wav_slices.append(slice(*wav_range))

    # Evaluate whether extra padding is warranted or not
    last_wav_range = wav_slices[-1]
    coverage = (n_samples - last_wav_range.start) / (last_wav_range.stop - last_wav_range.start)
    if coverage < min_coverage and len(mel_slices) > 1:
        mel_slices = mel_slices[:-1]
        wav_slices = wav_slices[:-1]

    return wav_slices, mel_slices


def embed_utterance(wav: np.ndarray, smodel: ov.CompiledModel, return_partials=False, rate=1.3, min_coverage=0.75):
    """
    Computes an embedding for a single utterance. The utterance is divided in partial
    utterances and an embedding is computed for each. The complete utterance embedding is the
    L2-normed average embedding of the partial utterances.

    :param wav: a preprocessed utterance waveform as a numpy array of float32
    :param smodel: compiled speaker encoder model.
    :param return_partials: if True, the partial embeddings will also be returned along with
    the wav slices corresponding to each partial utterance.
    :param rate: how many partial utterances should occur per second. Partial utterances must
    cover the span of the entire utterance, thus the rate should not be lower than the inverse
    of the duration of a partial utterance. By default, partial utterances are 1.6s long and
    the minimum rate is thus 0.625.
    :param min_coverage: when reaching the last partial utterance, it may or may not have
    enough frames. If at least <min_pad_coverage> of <partials_n_frames> are present,
    then the last partial utterance will be considered by zero-padding the audio. Otherwise,
    it will be discarded. If there aren't enough frames for one partial utterance,
    this parameter is ignored so that the function always returns at least one slice.
    :return: the embedding as a numpy array of float32 of shape (model_embedding_size,). If
    <return_partials> is True, the partial utterances as a numpy array of float32 of shape
    (n_partials, model_embedding_size) and the wav partials as a list of slices will also be
    returned.
    """
    # Compute where to split the utterance into partials and pad the waveform with zeros if
    # the partial utterances cover a larger range.
    wav_slices, mel_slices = compute_partial_slices(len(wav), rate, min_coverage)
    max_wave_length = wav_slices[-1].stop
    if max_wave_length >= len(wav):
        wav = np.pad(wav, (0, max_wave_length - len(wav)), "constant")

    # Split the utterance into partials and forward them through the model
    mel = audio.wav_to_mel_spectrogram(wav)
    mels = np.array([mel[s] for s in mel_slices])
    with torch.no_grad():
        mels = torch.from_numpy(mels).to(torch.device('cpu'))
        output_layer = smodel.output(0)
        partial_embeds = smodel(mels)[output_layer]

    # Compute the utterance embedding from the partial embeddings
    raw_embed = np.mean(partial_embeds, axis=0)
    embed = raw_embed / np.linalg.norm(raw_embed, 2)

    if return_partials:
        return embed, partial_embeds, wav_slices
    return embed

OpenVINO を使用して推論を実行するデバイスをドロップダウン・リストから選択します。

device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

次にモデルをコンパイルします。

compiled_smodel = core.compile_model(ir_smodel, device.value)

同様にデコーダー機能を実装した SynthesizerTrn モデルを OpenVINO IR 形式にエクスポートします。

OUTPUT_DIR = Path("output")
BASE_MODEL_NAME = "net_g"
onnx_net_g_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "_fp32")).with_suffix(".onnx")
ir_net_g_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "ir")).with_suffix(".xml")

dummy_input_1 = torch.randn(1, 1024, 81)
dummy_input_2 = torch.randn(1, 256)

# define forward as infer
net_g.forward = net_g.infer


if not ir_net_g_path.exists():
    ir_net_g_model = ov.convert_model(net_g, example_input=(dummy_input_1, dummy_input_2))
    ov.save_model(ir_net_g_model, ir_net_g_path)
else:
    ir_net_g_model = core.read_model(ir_net_g_path)
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/jit/_trace.py:1102: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Tensor-likes are not close!

Mismatched elements: 25920 / 25920 (100.0%)
Greatest absolute difference: 1.6649601459503174 at index (0, 0, 25248) (up to 1e-05 allowed)
Greatest relative difference: 12715.998370804822 at index (0, 0, 5088) (up to 1e-05 allowed)
  _check_trace(

OpenVINO を使用して推論を実行するデバイスをドロップダウン・リストから選択します。

compiled_ir_net_g_model = core.compile_model(ir_net_g_model, device.value)

合成のため関数を定義します。

def synthesize_audio(src, tgt):
    wav_tgt, _ = librosa.load(tgt, sr=hps.data.sampling_rate)
    wav_tgt, _ = librosa.effects.trim(wav_tgt, top_db=20)

    g_tgt = embed_utterance(wav_tgt, compiled_smodel)
    g_tgt = torch.from_numpy(g_tgt).unsqueeze(0)

    # src
    wav_src, _ = librosa.load(src, sr=hps.data.sampling_rate)
    wav_src = np.expand_dims(wav_src, axis=0)

    output_layer = compiled_cmodel.output(0)
    c = compiled_cmodel(wav_src)[output_layer]
    c = c.transpose((0, 2, 1))

    output_layer = compiled_ir_net_g_model.output(0)
    tgt_audio = compiled_ir_net_g_model((c, g_tgt))[output_layer]
    tgt_audio = tgt_audio[0][0]

    return tgt_audio

そして、IR モデルのみを使用して推論を確認できるようになります。

result_wav_names = []

with torch.no_grad():
    for line in tqdm(zip(srcs, tgts)):
        src, tgt = line

        output_audio = synthesize_audio(src, tgt)

        timestamp = time.strftime("%m-%d_%H-%M", time.localtime())
        result_name = f'{timestamp}.wav'
        result_wav_names.append(result_name)
        write(
            os.path.join('outputs/freevc', result_name),
            hps.data.sampling_rate,
            output_audio
        )
0it [00:00, ?it/s]
1it [00:00,  1.34it/s]
2it [00:02,  1.29s/it]
2it [00:02,  1.21s/it]

結果のオーディオファイルは ‘outputs/freevc’ で利用可能になり、それらを確認して以前に生成されたものと比較できます。以下に結果の 1 つを示します。

ソースオーディオ (テキストのソース):

import IPython.display as ipd
ipd.Audio(srcs[0])

ターゲットオーディオ (音声ソース):

ipd.Audio(tgts[0])

結果オーディオ:

ipd.Audio(f'outputs/freevc/{result_wav_names[0]}')

独自の音声ファイルを使用することもできます。アップロードするだけで推論に使用できます。hps.data.sampling_rate の値に対応するレートを使用します。

import gradio as gr


audio1 = gr.Audio(label="Source Audio", type='filepath')
audio2 = gr.Audio(label="Reference Audio", type='filepath')
outputs = gr.Audio(label="Output Audio", type='filepath')
examples = [[audio1_name, audio2_name]]

title = 'FreeVC with Gradio'
description = 'Gradio Demo for FreeVC and OpenVINO™. Upload a source audio and a reference audio, then click the "Submit" button to inference.'


def infer(src, tgt):
                                output_audio = synthesize_audio(src, tgt)

    timestamp = time.strftime("%m-%d_%H-%M", time.localtime())
    result_name = f'{timestamp}.wav'
    write(result_name, hps.data.sampling_rate, output_audio)

    return result_name


iface = gr.Interface(infer, [audio1, audio2], outputs, title=title, description=description, examples=examples)
iface.launch()
# if you are launching remotely, specify server_name and server_port
# iface.launch(server_name='your server name', server_port='server port in int')
# if you have any issue to launch on your platform, you can pass share=True to launch method:
# iface.launch(share=True)
# it creates a publicly shareable link for the interface. Read more in the docs: https://gradio.app/docs/
Running on local URL:  http://127.0.0.1:7860

To create a public link, set share=True in launch().
iface.close()
Closing server running on port: 7860