FreeVC と OpenVINO™ による高品質のテキストフリーのワンショット音声変換#

この Jupyter ノートブックは、ローカルへのインストール後にのみ起動できます。

FreeVC を使用すると、テキスト・アノテーションを付けず、言語コンテンツを変更することなく、ソースの話者音声をターゲットスタイルに変更できます。

以下の図は、推論用の FreeVC のモデル・アーキテクチャーを示しています。このノートでは推論部分のみに焦点を当てます。主なパーツは 3 つあります: 事前エンコーダー、スピーカー・エンコーダー、およびデコーダー。事前エンコーダーには、WavLM モデル、ボトルネック抽出器、および正規化フローが含まれています。詳細については、この文書を参照してください。

**画像出典*

FreeVC は、コマンドライン・インターフェイスを使用し、CUDA のみを使用することを提案します。このノートブックでは、CUDA デバイスを使用せずに Python で FreeVC を使用する方法を示します。これは次の手順で構成されます:

モデルをダウンロードして準備します。
推論。
モデルを OpenVINO 中間表現に変換します。
OpenVINO の IR モデルのみを使用した推論。

目次:

必要条件
インポートと設定
モデルを OpenVINO 中間表現に変換

必要条件#

この手順は手動で行うことも、ノートブックの実行中に自動的に実行されることもありますが、必要最小限の範囲で実行されます。

このリポジトリーのクローンを作成します: git clone OlaWod/FreeVC.git。
WavLM-Large をダウンロードし、FreeVC/wavlm/ ディレクトリーの下に置きます。
VCTK データセットをダウンロードできます。この例では、Hugging Face FreeVC の例から 2 つだけをダウンロードします。
事前トレーニングされたモデルをダウンロードし、ディレクトリー ‘checkpoints’ の下に置きます (この例では、freevc.pth のみが必要です)。

追加の要件をインストールします

%pip install -q "openvino>=2023.3.0" "librosa>=0.8.1" "webrtcvad==2.0.10" "gradio>=4.19" "torch>=2.1" gdown scipy tqdm torchvision --extra-index-url https://download.pytorch.org/whl/cpu

Note: you may need to restart the kernel to use updated packages.

FreeVC がインストールされているかどうかを確認し、そのパスを sys.path 追加します

from pathlib import Path 
import sys 

free_vc_repo = "FreeVC" 
if not Path(free_vc_repo).exists():
     !git clone https://github.com/OlaWod/FreeVC.git 

sys.path.append(free_vc_repo)

Cloning into 'FreeVC'... 
remote: Enumerating objects: 131, done.[K 
remote: Counting objects: 100% (65/65), done.[K 
remote: Compressing objects: 100% (41/41), done.[K 
remote: Total 131 (delta 39), reused 24 (delta 24), pack-reused 66[K 
Receiving objects: 100% (131/131), 15.28 MiB | 25.91 MiB/s, done.
Resolving deltas: 100% (43/43), done.

# Fetch `notebook_utils` module 
import requests 
import gdown 

r = requests.get( 

url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py", 
) 

open("notebook_utils.py", "w").write(r.text) 
from notebook_utils import download_file 

wavlm_large_dir_path = Path("FreeVC/wavlm") 
wavlm_large_path = wavlm_large_dir_path / "WavLM-Large.pt" 

wavlm_url = "https://drive.google.com/uc?id=12-cB34qCTvByWT-QtOcZaqwwO21FLSqU&confirm=t&uuid=a703c43c-ccce-436c-8799-c11b88e9e7e4" 

if not wavlm_large_path.exists(): 
    gdown.download(wavlm_url, str(wavlm_large_path))

Downloading...From: https://drive.google.com/uc?id=12-cB34qCTvByWT-QtOcZaqwwO21FLSqU&confirm=t&uuid=a703c43c-ccce-436c-8799-c11b88e9e7e4 
To: /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-727/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/WavLM-Large.pt 
100%|██████████| 1.26G/1.26G [00:11<00:00, 108MB/s]

freevc_chpt_dir = Path("checkpoints") 
freevc_chpt_name = "freevc.pth" 
freevc_chpt_path = freevc_chpt_dir / freevc_chpt_name 

if not freevc_chpt_path.exists(): 
    download_file( 

f"https://storage.openvinotoolkit.org/repositories/openvino_notebooks/models/freevc/{freevc_chpt_name}", 
    directory=freevc_chpt_dir, 
)

checkpoints/freevc.pth: 0%|          | 0.00/451M [00:00<?, ?B/s]

audio1_name = "p225_001.wav" 
audio1_url = f"https://huggingface.co/spaces/OlaWod/FreeVC/resolve/main/{audio1_name}" 
audio2_name = "p226_002.wav" 
audio2_url = f"https://huggingface.co/spaces/OlaWod/FreeVC/resolve/main/{audio2_name}" 

if not Path(audio1_name).exists(): 
    download_file(audio1_url) 

if not Path(audio2_name).exists(): 
    download_file(audio2_url)

p225_001.wav: 0%|          | 0.00/50.8k [00:00<?, ?B/s]

p226_002.wav: 0%|          | 0.00/135k [00:00<?, ?B/s]

インポートと設定#

import logging 
import os 
import time 

import librosa 
import numpy as np 
import torch 
from scipy.io.wavfile import write 
from tqdm import tqdm 

import openvino as ov 

import utils 
from models import SynthesizerTrn 
from speaker_encoder.voice_encoder import SpeakerEncoder 
from wavlm import WavLM, WavLMConfig 

logger = logging.getLogger() 
logger.setLevel(logging.CRITICAL)

CUDA を除外するために utils から関数 get_model を再定義します。

def get_cmodel(): 
    checkpoint = torch.load(wavlm_large_path) 
    cfg = WavLMConfig(checkpoint["cfg"]) 
    cmodel = WavLM(cfg) 
    cmodel.load_state_dict(checkpoint["model"]) 
    cmodel.eval() 

    return cmodel

モデルを初期化します

hps = utils.get_hparams_from_file("FreeVC/configs/freevc.json") 
os.makedirs("outputs/freevc", exist_ok=True) 

net_g = SynthesizerTrn(hps.data.filter_length // 2 + 1, hps.train.segment_size // hps.data.hop_length, **hps.model) 

utils.load_checkpoint(freevc_chpt_path, net_g, optimizer=None, strict=True) 
cmodel = get_cmodel() 
smodel = SpeakerEncoder("FreeVC/speaker_encoder/ckpt/pretrained_bak_5805000.pt", device="cpu")

/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-727/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. 
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")

Loaded the voice encoder model on cpu in 0.01 seconds.

データセットの設定を読み取ります

srcs = [audio1_name, audio2_name] 
tgts = [audio2_name, audio1_name]

推論

with torch.no_grad(): 
    for line in tqdm(zip(srcs, tgts)): 
        src, tgt = line 
        # tgt 
        wav_tgt, _ = librosa.load(tgt, sr=hps.data.sampling_rate) 
        wav_tgt, _ = librosa.effects.trim(wav_tgt, top_db=20) 

        g_tgt = smodel.embed_utterance(wav_tgt) 
        g_tgt = torch.from_numpy(g_tgt).unsqueeze(0) 

        # src 
        wav_src, _ = librosa.load(src, sr=hps.data.sampling_rate) 
        wav_src = torch.from_numpy(wav_src).unsqueeze(0) 

        c = utils.get_content(cmodel, wav_src) 

        tgt_audio = net_g.infer(c, g=g_tgt) 
        tgt_audio = tgt_audio[0][0].data.cpu().float().numpy() 

        timestamp = time.strftime("%m-%d_%H-%M", time.localtime()) 
        write( 
            os.path.join("outputs/freevc", "{}.wav".format(timestamp)), 
            hps.data.sampling_rate, 
            tgt_audio, 
        )

2it [00:03, 1.55s/it]

結果のオーディオファイルは ‘outputs/freevc’ で利用できます

モデルを OpenVINO 中間表現に変換#

各モデルを FP16 精度で OpenVINO IR に変換します。ov.convert_model 関数は、元の PyTorch モデル・オブジェクトとトレース用の入力例を受け入れ、このモデルを表す OpenVINO Model クラスのインスタンスを返します。取得したモデルはすぐに使用でき、compile_model を使用してデバイスにロードするか、ov.save_model 関数でディスクに保存できます。read_model メソッドは、保存されたモデルをディスクからロードします。モデル変換の詳細については、このページを参照してください。

最初に、以前のエンコーダーを OpenVINO IR 形式に変換する一環として、WavLM モデルを変換します。モデルの元の名前をコード内に保持します: cmodel。

# 互換性のために forward を extract_features として定義 
cmodel.forward = cmodel.extract_features

OUTPUT_DIR = Path("output") 
BASE_MODEL_NAME = "cmodel" 

OUTPUT_DIR.mkdir(exist_ok=True) 

ir_cmodel_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "_ir")).with_suffix(".xml") 

length = 32000 

dummy_input = torch.randn(1, length)

OpenVINO の IR 形式に変換します。

core = ov.Core() 

class ModelWrapper(torch.nn.Module): 
    def __init__(self, model): 
        super().__init__() 
        self.model = model 

    def forward(self, input): 
        return self.model(input)[0] 

if not ir_cmodel_path.exists(): 
    ir_cmodel = ov.convert_model(ModelWrapper(cmodel), example_input=dummy_input) 
    ov.save_model(ir_cmodel, ir_cmodel_path) 
else: 
    ir_cmodel = core.read_model(ir_cmodel_path)

/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-727/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/modules.py:495: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect.We can't record the data flow of Python values, so this value will be treated as a constant in the future.This means that the trace might not generalize to other inputs! 
  assert embed_dim == self.embed_dim 
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-727/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/modules.py:496: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect.We can't record the data flow of Python values, so this value will be treated as a constant in the future.This means that the trace might not generalize to other inputs! 
  assert list(query.size()) == [tgt_len, bsz, embed_dim] 
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-727/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/modules.py:500: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect.We can't record the data flow of Python values, so this value will be treated as a constant in the future.This means that the trace might not generalize to other inputs! 
  assert key_bsz == bsz 
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-727/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/modules.py:502: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect.We can't record the data flow of Python values, so this value will be treated as a constant in the future.This means that the trace might not generalize to other inputs! 
  assert src_len, bsz == value.shape[:2]

OpenVINO を使用して推論を実行するデバイスをドロップダウン・リストから選択します。

import ipywidgets as widgets 

device = widgets.Dropdown( 
    options=core.available_devices + ["AUTO"], 
    value="AUTO", 
    description="Device:", 
    disabled=False, 
) 

device

Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

compiled_cmodel = core.compile_model(ir_cmodel, device.value)

OUTPUT_DIR = Path("output") 
BASE_MODEL_NAME = "smodel" 

OUTPUT_DIR.mkdir(exist_ok=True) 

ir_smodel_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "ir")).with_suffix(".xml") 

length = 32000 

dummy_input = torch.randn(1, length, 40) 

if not ir_smodel_path.exists(): 
    ir_smodel = ov.convert_model(smodel, example_input=dummy_input) ov.save_model(ir_smodel, ir_smodel_path) 
else: 
    ir_smodel = core.read_model(ir_smodel_path)

推論の入力を準備するには、speaker_encoder.voice_encoder.SpeakerEncoder クラスメソッドをベースにヘルパー関数を定義する必要があります。

from speaker_encoder.hparams import sampling_rate, mel_window_step, partials_n_frames 
from speaker_encoder import audio 

def compute_partial_slices(n_samples: int, rate, min_coverage): 
    """ 
    Computes where to split an utterance waveform and its corresponding mel spectrogram to 
    obtain partial utterances of <partials_n_frames> each.Both the waveform and the 
    mel spectrogram slices are returned, so as to make each partial utterance waveform 
    correspond to its spectrogram.

    The returned ranges may be indexing further than the length of the waveform. It is
    recommended that you pad the waveform with zeros up to wav_slices[-1].stop.
 
    :param n_samples: the number of samples in the waveform 
    :param rate: how many partial utterances should occur per second.Partial utterances must 
    cover the span of the entire utterance, thus the rate should not be lower than the inverse 
    of the duration of a partial utterance.By default, partial utterances are 1.6s long and 
    the minimum rate is thus 0.625.
    :param min_coverage: when reaching the last partial utterance, it may or may not have 
    enough frames.If at least <min_pad_coverage> of <partials_n_frames> are present, 
    then the last partial utterance will be considered by zero-padding the audio.Otherwise, 
    it will be discarded.If there aren't enough frames for one partial utterance, 
    this parameter is ignored so that the function always returns at least one slice. 
    :return: the waveform slices and mel spectrogram slices as lists of array slices.Index 
    respectively the waveform and the mel spectrogram with these slices to obtain the partial 
    utterances.
    """ 
    assert 0 < min_coverage <= 1 

    # 2 つの部分的な発話の間に何フレームあるかを計算 
    samples_per_frame = int((sampling_rate * mel_window_step / 1000)) 
    n_frames = int(np.ceil((n_samples + 1) / samples_per_frame)) 
    frame_step = int(np.round((sampling_rate / rate) / samples_per_frame)) 
    assert 0 < frame_step, "The rate is too high" 
    assert frame_step <= partials_n_frames, "The rate is too low, it should be %f at least" % (sampling_rate / (samples_per_frame * partials_n_frames)) 

    # スライスを計算 
    wav_slices, mel_slices = [], [] 
    steps = max(1, n_frames - partials_n_frames + frame_step + 1) 
    for i in range(0, steps, frame_step): 
        mel_range = np.array([i, i + partials_n_frames]) 
        wav_range = mel_range * samples_per_frame 
        mel_slices.append(slice(*mel_range)) 
        wav_slices.append(slice(*wav_range)) 

    # 追加のパディングが必要であるか評価 
    last_wav_range = wav_slices[-1] 
    coverage = (n_samples - last_wav_range.start) / (last_wav_range.stop - last_wav_range.start) 
    if coverage < min_coverage and len(mel_slices) > 1: 
        mel_slices = mel_slices[:-1] 
        wav_slices = wav_slices[:-1] 

    return wav_slices, mel_slices 

def embed_utterance( 
    wav: np.ndarray, 
    smodel: ov.CompiledModel, 
    return_partials=False, rate=1.3, 
    min_coverage=0.75, 
): 
    """ 
    Computes an embedding for a single utterance.The utterance is divided in partial 
    utterances and an embedding is computed for each.The complete utterance embedding is the 
    L2-normed average embedding of the partial utterances.

    :param wav: a preprocessed utterance waveform as a numpy array of float32 
    :param smodel: compiled speaker encoder model.
    :param return_partials: if True, the partial embeddings will also be returned along with 
    the wav slices corresponding to each partial utterance. 
    :param rate: how many partial utterances should occur per second.Partial utterances must 
    cover the span of the entire utterance, thus the rate should not be lower than the inverse 
    of the duration of a partial utterance.By default, partial utterances are 1.6s long and 
    the minimum rate is thus 0.625.
    :param min_coverage: when reaching the last partial utterance, it may or may not have 
    enough frames. If at least <min_pad_coverage> of <partials_n_frames> are present, 
    then the last partial utterance will be considered by zero-padding the audio.Otherwise, 
    it will be discarded.If there aren't enough frames for one partial utterance, 
    this parameter is ignored so that the function always returns at least one slice. 
    :return: the embedding as a numpy array of float32 of shape (model_embedding_size,).If 
    <return_partials> is True, the partial utterances as a numpy array of float32 of shape 
    (n_partials, model_embedding_size) and the wav partials as a list of slices will also be 
    returned.
    """ 
    # 発話をどこで部分的に分割するかを計算し、部分的な発話が 
    # より広い範囲をカバーする場合は波形をゼロで埋める 
    wav_slices, mel_slices = compute_partial_slices(len(wav), rate, min_coverage) 
    max_wave_length = wav_slices[-1].stop 
    if max_wave_length >= len(wav): 
        wav = np.pad(wav, (0, max_wave_length - len(wav)), "constant") 

    # 発話を部分的に分割してモデルに転送 
    mel = audio.wav_to_mel_spectrogram(wav) 
    mels = np.array([mel[s] for s in mel_slices]) 
    with torch.no_grad(): 
        mels = torch.from_numpy(mels).to(torch.device("cpu")) 
        output_layer = smodel.output(0) 
        partial_embeds = smodel(mels)[output_layer] 

    # 部分埋め込みから発話埋め込みを計算 
    raw_embed = np.mean(partial_embeds, axis=0) 
    embed = raw_embed / np.linalg.norm(raw_embed, 2) 

    if return_partials: 
        return embed, partial_embeds, wav_slices 
    return embed

OpenVINO を使用して推論を実行するデバイスをドロップダウン・リストから選択します。

device

Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

次にモデルをコンパイルします。

compiled_smodel = core.compile_model(ir_smodel, device.value)

同様にデコーダー機能を実装した SynthesizerTrn モデルを OpenVINO IR 形式にエクスポートします。

OUTPUT_DIR = Path("output") 
BASE_MODEL_NAME = "net_g" 
onnx_net_g_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "_fp32")).with_suffix(".onnx") 
ir_net_g_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "ir")).with_suffix(".xml") 

dummy_input_1 = torch.randn(1, 1024, 81) 
dummy_input_2 = torch.randn(1, 256) 

# 前方を推測として定義
net_g.forward = net_g.infer 

if not ir_net_g_path.exists(): 
    ir_net_g_model = ov.convert_model(net_g, example_input=(dummy_input_1, dummy_input_2)) 
    ov.save_model(ir_net_g_model, ir_net_g_path) 
else: 
    ir_net_g_model = core.read_model(ir_net_g_path)

/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-727/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/jit/_trace.py:1116: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error: Tensor-likes are not close!
 Mismatched elements: 25671 / 25920 (99.0%) 
Greatest absolute difference: 0.025559324771165848 at index (0, 0, 3076) (up to 1e-05 allowed) 
Greatest relative difference: 26733.040729741195 at index (0, 0, 14200) (up to 1e-05 allowed) 
  _check_trace(

OpenVINO を使用して推論を実行するデバイスをドロップダウン・リストから選択します。

compiled_ir_net_g_model = core.compile_model(ir_net_g_model, device.value)

合成のため関数を定義します。

def synthesize_audio(src, tgt): 
    wav_tgt, _ = librosa.load(tgt, sr=hps.data.sampling_rate) 
    wav_tgt, _ = librosa.effects.trim(wav_tgt, top_db=20) 

    g_tgt = embed_utterance(wav_tgt, compiled_smodel) 
    g_tgt = torch.from_numpy(g_tgt).unsqueeze(0) 

    # src 
    wav_src, _ = librosa.load(src, sr=hps.data.sampling_rate) 
    wav_src = np.expand_dims(wav_src, axis=0) 

    output_layer = compiled_cmodel.output(0) 
    c = compiled_cmodel(wav_src)[output_layer] 
    c = c.transpose((0, 2, 1)) 

    output_layer = compiled_ir_net_g_model.output(0) 
    tgt_audio = compiled_ir_net_g_model((c, g_tgt))[output_layer] 
    tgt_audio = tgt_audio[0][0] 

    return tgt_audio

そして、IR モデルのみを使用して推論を確認できるようになります。

result_wav_names = [] 

with torch.no_grad(): 
    for line in tqdm(zip(srcs, tgts)): 
        src, tgt = line 

        output_audio = synthesize_audio(src, tgt) 

        timestamp = time.strftime("%m-%d_%H-%M", time.localtime()) 
        result_name = f"{timestamp}.wav" 
        result_wav_names.append(result_name) 
        write( 
            os.path.join("outputs/freevc", result_name), 
            hps.data.sampling_rate, 
            output_audio, 
        )

2it [00:01, 1.23it/s]

結果のオーディオファイルは ‘outputs/freevc’ で利用可能になり、それらを確認して以前に生成されたものと比較できます。以下に結果の 1 つを示します。

ソースオーディオ (テキストのソース):

import IPython.display as ipd 

ipd.Audio(srcs[0])

ターゲットオーディオ (音声ソース):

ipd.Audio(tgts[0])

結果オーディオ:

ipd.Audio(f"outputs/freevc/{result_wav_names[0]}")

独自の音声ファイルを使用することもできます。アップロードするだけで推論に使用できます。hps.data.sampling_rate の値に対応するレートを使用します。

import gradio as gr 

audio1 = gr.Audio(label="Source Audio", type="filepath") 
audio2 = gr.Audio(label="Reference Audio", type="filepath") 
outputs = gr.Audio(label="Output Audio", type="filepath") 
examples = [[audio1_name, audio2_name]] 

title = "FreeVC with Gradio" 
description = 'Gradio Demo for FreeVC and OpenVINO™.Upload a source audio and a reference audio, then click the "Submit" button to inference.' 

def infer(src, tgt): 
    output_audio = synthesize_audio(src, tgt) 

    timestamp = time.strftime("%m-%d_%H-%M", time.localtime()) 
    result_name = f"{timestamp}.wav" 
    write(result_name, hps.data.sampling_rate, output_audio) 

    return result_name 

iface = gr.Interface( 
    infer, 
    [audio1, audio2], 
    outputs, 
    title=title, 
    description=description, 
    examples=examples, 
) 
iface.launch() 
# リモートで起動する場合は、server_name と server_port を指定 
# iface.launch(server_name='your server name', server_port='server port in int') 
# プラットフォーム上で起動する際に問題がある場合は、起動メソッドに share=True を渡すことができます: 
# iface.launch(share=True) 
# インターフェイスの公開共有可能なリンクを作成。詳細はドキュメントをご覧ください: https://gradio.app/docs/ 

ローカル URL で実行中: http://127.0.0.1:7860 
パブリックリンクを作成するには、launch() で share=True を設定します。

iface.close()

Closing server running on port: 7860