OpenVoice と OpenVINO を使用した音声トーンのクローン作成¶

この Jupyter ノートブックはオンラインで起動でき、ブラウザーのウィンドウで対話型環境を開きます。ローカルにインストールすることもできます。次のオプションのいずれかを選択します。

OpenVoice は、ソーススピーカーからの短いオーディオの断片だけを使用して、さまざまな言語で音声を転送および生成する多用途のインスタント・ボイス・トーンです。OpenVoice には 3 つの機能があります。

複数の言語とアクセントを備えた高品質な音色の複製。
感情、アクセント、およびリズム、ポーズ、イントネーションなどのその他のパラメーターを含む音声スタイルを微調整して制御。
OpenVoice は、ゼロショットの言語間音声クローン作成を実現し、生成された音声と参照音声が大規模な話者の多言語トレーニング・データセットの一部である必要性を排除します。

画像¶

モデルの詳細については、プロジェクトのウェブページ、論文、公式リポジトリーを参照してください。

このノートブックでは、PyTorch OpenVoice モデルを OpenVINO IR に変換する例を示します。このチュートリアルでは、OpenVINO を使用して OpenVoice を変換して実行する方法を検討します。

リポジトリーのクローンを作成して要件をインストール¶

                                        import sys
from pathlib import Path

repo_dir = Path("OpenVoice")

if not repo_dir.exists():
    !git clone https://github.com/myshell-ai/OpenVoice

# append to sys.path so that modules from the repo could be imported
sys.path.append(str(repo_dir))

%pip install -q \
"librosa>=0.8.1" \
"wavmark>=0.0.3" \
"faster-whisper>=0.9.0" \
"pydub>=0.25.1" \
"whisper-timestamped>=1.14.2" \
"tqdm" \
"inflect>=7.0.0" \
"unidecode>=1.3.7" \
"eng_to_ipa>=0.0.2" \
"pypinyin>=0.50.0" \
"cn2an>=0.5.22" \
"jieba>=0.42.1" \
"langid>=1.1.6" \
"gradio>=4.15" \
"ipywebrtc" \
"ffmpeg-downloader"

!ffdl install -y

                                    

                                        Cloning into 'OpenVoice'...

remote: Enumerating objects: 317, done.[K

remote: Counting objects:   0% (1/135)[K
remote: Counting objects:   1% (2/135)[K
remote: Counting objects:   2% (3/135)[K
remote: Counting objects:   3% (5/135)[K
remote: Counting objects:   4% (6/135)[K
remote: Counting objects:   5% (7/135)[K
remote: Counting objects:   6% (9/135)[K
remote: Counting objects:   7% (10/135)[K
remote: Counting objects:   8% (11/135)[K
remote: Counting objects:   9% (13/135)[K
remote: Counting objects:  10% (14/135)[K
remote: Counting objects:  11% (15/135)[K
remote: Counting objects:  12% (17/135)[K
remote: Counting objects:  13% (18/135)[K
remote: Counting objects:  14% (19/135)[K
remote: Counting objects:  15% (21/135)[K
remote: Counting objects:  16% (22/135)[K
remote: Counting objects:  17% (23/135)[K
remote: Counting objects:  18% (25/135)[K
remote: Counting objects:  19% (26/135)[K
remote: Counting objects:  20% (27/135)[K
remote: Counting objects:  21% (29/135)[K
remote: Counting objects:  22% (30/135)[K
remote: Counting objects:  23% (32/135)[K
remote: Counting objects:  24% (33/135)[K
remote: Counting objects:  25% (34/135)[K
remote: Counting objects:  26% (36/135)[K
remote: Counting objects:  27% (37/135)[K
remote: Counting objects:  28% (38/135)[K
remote: Counting objects:  29% (40/135)[K
remote: Counting objects:  30% (41/135)[K
remote: Counting objects:  31% (42/135)[K
remote: Counting objects:  32% (44/135)[K
remote: Counting objects:  33% (45/135)[K
remote: Counting objects:  34% (46/135)[K
remote: Counting objects:  35% (48/135)[K
remote: Counting objects:  36% (49/135)[K
remote: Counting objects:  37% (50/135)[K
remote: Counting objects:  38% (52/135)[K
remote: Counting objects:  39% (53/135)[K
remote: Counting objects:  40% (54/135)[K
remote: Counting objects:  41% (56/135)[K
remote: Counting objects:  42% (57/135)[K
remote: Counting objects:  43% (59/135)[K
remote: Counting objects:  44% (60/135)[K
remote: Counting objects:  45% (61/135)[K
remote: Counting objects:  46% (63/135)[K
remote: Counting objects:  47% (64/135)[K
remote: Counting objects:  48% (65/135)[K
remote: Counting objects:  49% (67/135)[K
remote: Counting objects:  50% (68/135)[K
remote: Counting objects:  51% (69/135)[K
remote: Counting objects:  52% (71/135)[K
remote: Counting objects:  53% (72/135)[K
remote: Counting objects:  54% (73/135)[K
remote: Counting objects:  55% (75/135)[K
remote: Counting objects:  56% (76/135)[K
remote: Counting objects:  57% (77/135)[K
remote: Counting objects:  58% (79/135)[K
remote: Counting objects:  59% (80/135)[K
remote: Counting objects:  60% (81/135)[K
remote: Counting objects:  61% (83/135)[K
remote: Counting objects:  62% (84/135)[K
remote: Counting objects:  63% (86/135)[K
remote: Counting objects:  64% (87/135)[K
remote: Counting objects:  65% (88/135)[K
remote: Counting objects:  66% (90/135)[K
remote: Counting objects:  67% (91/135)[K
remote: Counting objects:  68% (92/135)[K
remote: Counting objects:  69% (94/135)[K
remote: Counting objects:  70% (95/135)[K
remote: Counting objects:  71% (96/135)[K
remote: Counting objects:  72% (98/135)[K
remote: Counting objects:  73% (99/135)[K
remote: Counting objects:  74% (100/135)[K
remote: Counting objects:  75% (102/135)[K
remote: Counting objects:  76% (103/135)[K
remote: Counting objects:  77% (104/135)[K
remote: Counting objects:  78% (106/135)[K
remote: Counting objects:  79% (107/135)[K
remote: Counting objects:  80% (108/135)[K
remote: Counting objects:  81% (110/135)[K
remote: Counting objects:  82% (111/135)[K
remote: Counting objects:  83% (113/135)[K
remote: Counting objects:  84% (114/135)[K
remote: Counting objects:  85% (115/135)[K
remote: Counting objects:  86% (117/135)[K
remote: Counting objects:  87% (118/135)[K
remote: Counting objects:  88% (119/135)[K
remote: Counting objects:  89% (121/135)[K
remote: Counting objects:  90% (122/135)[K
remote: Counting objects:  91% (123/135)[K
remote: Counting objects:  92% (125/135)[K
remote: Counting objects:  93% (126/135)[K
remote: Counting objects:  94% (127/135)[K
remote: Counting objects:  95% (129/135)[K
remote: Counting objects:  96% (130/135)[K
remote: Counting objects:  97% (131/135)[K
remote: Counting objects:  98% (133/135)[K
remote: Counting objects:  99% (134/135)[K
remote: Counting objects: 100% (135/135)[K
remote: Counting objects: 100% (135/135), done.[K

remote: Compressing objects:   1% (1/52)[K
remote: Compressing objects:   3% (2/52)[K
remote: Compressing objects:   5% (3/52)[K
remote: Compressing objects:   7% (4/52)[K
remote: Compressing objects:   9% (5/52)[K
remote: Compressing objects:  11% (6/52)[K
remote: Compressing objects:  13% (7/52)[K
remote: Compressing objects:  15% (8/52)[K
remote: Compressing objects:  17% (9/52)[K
remote: Compressing objects:  19% (10/52)[K
remote: Compressing objects:  21% (11/52)[K
remote: Compressing objects:  23% (12/52)[K
remote: Compressing objects:  25% (13/52)[K
remote: Compressing objects:  26% (14/52)[K
remote: Compressing objects:  28% (15/52)[K
remote: Compressing objects:  30% (16/52)[K
remote: Compressing objects:  32% (17/52)[K
remote: Compressing objects:  34% (18/52)[K
remote: Compressing objects:  36% (19/52)[K
remote: Compressing objects:  38% (20/52)[K
remote: Compressing objects:  40% (21/52)[K
remote: Compressing objects:  42% (22/52)[K
remote: Compressing objects:  44% (23/52)[K
remote: Compressing objects:  46% (24/52)[K
remote: Compressing objects:  48% (25/52)[K
remote: Compressing objects:  50% (26/52)[K
remote: Compressing objects:  51% (27/52)[K
remote: Compressing objects:  53% (28/52)[K
remote: Compressing objects:  55% (29/52)[K
remote: Compressing objects:  57% (30/52)[K
remote: Compressing objects:  59% (31/52)[K
remote: Compressing objects:  61% (32/52)[K
remote: Compressing objects:  63% (33/52)[K
remote: Compressing objects:  65% (34/52)[K
remote: Compressing objects:  67% (35/52)[K
remote: Compressing objects:  69% (36/52)[K
remote: Compressing objects:  71% (37/52)[K
remote: Compressing objects:  73% (38/52)[K
remote: Compressing objects:  75% (39/52)[K
remote: Compressing objects:  76% (40/52)[K
remote: Compressing objects:  78% (41/52)[K
remote: Compressing objects:  80% (42/52)[K
remote: Compressing objects:  82% (43/52)[K
remote: Compressing objects:  84% (44/52)[K
remote: Compressing objects:  86% (45/52)[K
remote: Compressing objects:  88% (46/52)[K
remote: Compressing objects:  90% (47/52)[K
remote: Compressing objects:  92% (48/52)[K
remote: Compressing objects:  94% (49/52)[K
remote: Compressing objects:  96% (50/52)[K
remote: Compressing objects:  98% (51/52)[K
remote: Compressing objects: 100% (52/52)[K
remote: Compressing objects: 100% (52/52), done.[K

Receiving objects:   0% (1/317)
Receiving objects:   1% (4/317)
Receiving objects:   2% (7/317)
Receiving objects:   3% (10/317)
Receiving objects:   4% (13/317)
Receiving objects:   5% (16/317)
Receiving objects:   6% (20/317)
Receiving objects:   7% (23/317)
Receiving objects:   8% (26/317)
Receiving objects:   9% (29/317)
Receiving objects:  10% (32/317)
Receiving objects:  11% (35/317)
Receiving objects:  12% (39/317)
Receiving objects:  13% (42/317)
Receiving objects:  14% (45/317)
Receiving objects:  15% (48/317)
Receiving objects:  16% (51/317)
Receiving objects:  17% (54/317)
Receiving objects:  18% (58/317)
Receiving objects:  19% (61/317)
Receiving objects:  20% (64/317)
Receiving objects:  21% (67/317)
Receiving objects:  22% (70/317)
Receiving objects:  23% (73/317)
Receiving objects:  24% (77/317)
Receiving objects:  25% (80/317)
Receiving objects:  26% (83/317)
Receiving objects:  27% (86/317)
Receiving objects:  28% (89/317)
Receiving objects:  29% (92/317)
Receiving objects:  30% (96/317)
Receiving objects:  31% (99/317)
Receiving objects:  32% (102/317)
Receiving objects:  33% (105/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  34% (108/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  35% (111/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  36% (115/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  37% (118/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  38% (121/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  39% (124/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  40% (127/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  41% (130/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  42% (134/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  43% (137/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  44% (140/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  45% (143/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  46% (146/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  47% (149/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  48% (153/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  49% (156/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  50% (159/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  51% (162/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  52% (165/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  53% (169/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  54% (172/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  55% (175/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  56% (178/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  57% (181/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  58% (184/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  59% (188/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  60% (191/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  61% (194/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  62% (197/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  63% (200/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  64% (203/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  65% (207/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  66% (210/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  67% (213/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  68% (216/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  69% (219/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  70% (222/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  71% (226/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  72% (229/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  73% (232/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  74% (235/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  75% (238/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  76% (241/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  77% (245/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  78% (248/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  79% (251/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  80% (254/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  81% (257/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  82% (260/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  83% (264/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  84% (267/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  85% (270/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  86% (273/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  87% (276/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  88% (279/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  89% (283/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  90% (286/317), 1.54 MiB | 3.03 MiB/s

remote: Total 317 (delta 99), reused 89 (delta 83), pack-reused 182[K

Receiving objects:  91% (289/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  92% (292/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  93% (295/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  94% (298/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  95% (302/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  96% (305/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  97% (308/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  98% (311/317), 1.54 MiB | 3.03 MiB/s
Receiving objects:  99% (314/317), 1.54 MiB | 3.03 MiB/s
Receiving objects: 100% (317/317), 1.54 MiB | 3.03 MiB/s
Receiving objects: 100% (317/317), 2.90 MiB | 3.26 MiB/s, done.

Resolving deltas:   0% (0/152)
Resolving deltas:   5% (8/152)
Resolving deltas:  13% (20/152)
Resolving deltas:  15% (24/152)
Resolving deltas:  16% (25/152)
Resolving deltas:  17% (26/152)
Resolving deltas:  19% (30/152)
Resolving deltas:  21% (32/152)
Resolving deltas:  23% (35/152)
Resolving deltas:  24% (37/152)
Resolving deltas:  35% (54/152)
Resolving deltas:  58% (89/152)
Resolving deltas:  60% (92/152)
Resolving deltas:  69% (105/152)
Resolving deltas:  73% (111/152)
Resolving deltas:  88% (135/152)
Resolving deltas:  90% (138/152)
Resolving deltas:  94% (144/152)
Resolving deltas:  96% (147/152)
Resolving deltas: 100% (152/152)
Resolving deltas: 100% (152/152), done.

                                    

                                        DEPRECATION: pytorch-lightning 1.6.5 has a non-standard dependency specifier torch>=1.8.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063

                                    

                                        Note: you may need to restart the kernel to use updated packages.

                                    

                                        Requirement already satisfied: ffmpeg==6.1 in /opt/home/k8sworker/.local/share/ffmpeg-downloader/ffmpeg

                                    

チェックポイントをダウンロードして PyTorch モデルをロード¶

                                        import os
import torch
import openvino as ov
import ipywidgets as widgets
from IPython.display import Audio

core = ov.Core()

from api import BaseSpeakerTTS, ToneColorConverter, OpenVoiceBaseClass
import se_extractor

                                    

                                        Importing the dtw module. When using in academic works please cite:
  T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package.
  J. Stat. Soft., doi:10.18637/jss.v031.i07.

                                    

                                        CKPT_BASE_PATH = 'checkpoints'

en_suffix = f'{CKPT_BASE_PATH}/base_speakers/EN'
zh_suffix = f'{CKPT_BASE_PATH}/base_speakers/ZH'
converter_suffix = f'{CKPT_BASE_PATH}/converter'

デフォルトでノートブックを軽量にするには、中国語音声モデルがアクティブ化されていません。オンにするには、enable_chinese_lang フラグを True に設定します。

                                        enable_chinese_lang = False

                                    

                                        def download_from_hf_hub(filename, local_dir='./'):
    from huggingface_hub import hf_hub_download
    os.makedirs(local_dir, exist_ok=True)
    hf_hub_download(repo_id="myshell-ai/OpenVoice", filename=filename, local_dir=local_dir)

download_from_hf_hub(f'{converter_suffix}/checkpoint.pth')
download_from_hf_hub(f'{converter_suffix}/config.json')
download_from_hf_hub(f'{en_suffix}/checkpoint.pth')
download_from_hf_hub(f'{en_suffix}/config.json')

download_from_hf_hub(f'{en_suffix}/en_default_se.pth')
download_from_hf_hub(f'{en_suffix}/en_style_se.pth')

if enable_chinese_lang:
    download_from_hf_hub(f'{zh_suffix}/checkpoint.pth')
    download_from_hf_hub(f'{zh_suffix}/config.json')
    download_from_hf_hub(f'{zh_suffix}/zh_default_se.pth')

                                    

                                        pt_device = "cpu"

en_base_speaker_tts = BaseSpeakerTTS(f'{en_suffix}/config.json', device=pt_device)
en_base_speaker_tts.load_ckpt(f'{en_suffix}/checkpoint.pth')

tone_color_converter = ToneColorConverter(f'{converter_suffix}/config.json', device=pt_device)
tone_color_converter.load_ckpt(f'{converter_suffix}/checkpoint.pth')

if enable_chinese_lang:
    zh_base_speaker_tts = BaseSpeakerTTS(f'{zh_suffix}/config.json', device=pt_device)
    zh_base_speaker_tts.load_ckpt(f'{zh_suffix}/checkpoint.pth')
else:
    zh_base_speaker_tts = None

                                    

                                        /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")

                                        Loaded checkpoint 'checkpoints/base_speakers/EN/checkpoint.pth'
missing/unexpected keys: [] []

                                        Loaded checkpoint 'checkpoints/converter/checkpoint.pth'
missing/unexpected keys: [] []

モデルを OpenVINO IR に変換¶

OpenVoice には 2 つのモデルがあります。1 つ目は音声生成 BaseSpeakerTTS を担当し、2 つ目は ToneColorConverter が元の音声に任意の声調を適用します。OpenVINO IR 形式に変換するには、まず許容可能な torch.nn.Module オブジェクトを取得する必要があります。ToneColorConverter と BaseSpeakerTTS は両方とも、self.forward をメインのエントリーポイントとして使用する代わりに、それぞれカスタム infer メソッドと convert_voice メソッドを使用するため、torch.nn.Module から継承されたカスタムクラスでそれらをラップする必要があります。

                                        class OVOpenVoiceBase(torch.nn.Module):
    """
    Base class for both TTS and voice tone conversion model: constructor is same for both of them.
    """
    def __init__(self, voice_model: OpenVoiceBaseClass):
        super().__init__()
        self.voice_model = voice_model
        for par in voice_model.model.parameters():
            par.requires_grad = False

class OVOpenVoiceTTS(OVOpenVoiceBase):
    """
    Constructor of this class accepts BaseSpeakerTTS object for speech generation and wraps it's 'infer' method with forward.
    """
    def get_example_input(self):
        stn_tst = self.voice_model.get_text('this is original text', self.voice_model.hps, False)
        x_tst = stn_tst.unsqueeze(0)
        x_tst_lengths = torch.LongTensor([stn_tst.size(0)])
        speaker_id = torch.LongTensor([1])
        noise_scale = torch.tensor(0.667)
        length_scale = torch.tensor(1.0)
        noise_scale_w = torch.tensor(0.6)
        return (x_tst, x_tst_lengths, speaker_id, noise_scale, length_scale, noise_scale_w)

    def forward(self, x, x_lengths, sid, noise_scale, length_scale, noise_scale_w):
        return self.voice_model.model.infer(x, x_lengths, sid, noise_scale, length_scale, noise_scale_w)

class OVOpenVoiceConverter(OVOpenVoiceBase):
    """
    Constructor of this class accepts ToneColorConverter object for voice tone conversion and wraps it's 'voice_conversion' method with forward.
    """
    def get_example_input(self):
        y = torch.randn([1, 513, 238], dtype=torch.float32)
        y_lengths = torch.LongTensor([y.size(-1)])
        target_se = torch.randn(*(1, 256, 1))
        source_se = torch.randn(*(1, 256, 1))
        tau = torch.tensor(0.3)
        return (y, y_lengths, source_se, target_se, tau)

    def forward(self, y, y_lengths, sid_src, sid_tgt, tau):
        return self.voice_model.model.voice_conversion(y, y_lengths, sid_src, sid_tgt, tau)

                                    

OpenVINO IR に変換し、今後使用できるように IRs_path フォルダーに保存します。IR がすでに存在する場合は、変換をスキップして直接読み取ります。

                                        IRS_PATH = 'openvino_irs/'
EN_TTS_IR = f'{IRS_PATH}/openvoice_en_tts.xml'
ZH_TTS_IR = f'{IRS_PATH}/openvoice_zh_tts.xml'
VOICE_CONVERTER_IR = f'{IRS_PATH}/openvoice_tone_conversion.xml'

paths = [EN_TTS_IR, VOICE_CONVERTER_IR]
models = [OVOpenVoiceTTS(en_base_speaker_tts), OVOpenVoiceConverter(tone_color_converter)]
if enable_chinese_lang:
    models.append(OVOpenVoiceTTS(zh_base_speaker_tts))
    paths.append(ZH_TTS_IR)
ov_models = []

for model, path in zip(models, paths):
    if not os.path.exists(path):
        ov_model = ov.convert_model(model, example_input=model.get_example_input())
        ov.save_model(ov_model, path)
    else:
        ov_model = core.read_model(path)
    ov_models.append(ov_model)

ov_en_tts, ov_voice_conversion = ov_models[:2]
if enable_chinese_lang:
    ov_zh_tts = ov_models[-1]

                                    

                                        this is original text.
    length:22
    length:21

                                    

                                        /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/notebooks/284-openvoice/OpenVoice/attentions.py:283: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert (
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/notebooks/284-openvoice/OpenVoice/attentions.py:346: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  pad_length = max(length - (self.window_size + 1), 0)
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/notebooks/284-openvoice/OpenVoice/attentions.py:347: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  slice_start_position = max((self.window_size + 1) - length, 0)
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/notebooks/284-openvoice/OpenVoice/attentions.py:349: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if pad_length > 0:

                                    

                                        /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/notebooks/284-openvoice/OpenVoice/transforms.py:114: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if torch.min(inputs) < left or torch.max(inputs) > right:
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/notebooks/284-openvoice/OpenVoice/transforms.py:119: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if min_bin_width * num_bins > 1.0:
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/notebooks/284-openvoice/OpenVoice/transforms.py:121: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if min_bin_height * num_bins > 1.0:
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/notebooks/284-openvoice/OpenVoice/transforms.py:171: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
                                        assert (discriminant >= 0).all()

                                    

                                        /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/jit/_trace.py:1102: TracerWarning: Trace had nondeterministic nodes. Did you forget call .eval() on your model? Nodes:
    %3293 : Float(1, 2, 43, strides=[86, 43, 1], requires_grad=0, device=cpu) = aten::randn(%3288, %3289, %3290, %3291, %3292) # /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/notebooks/284-openvoice/OpenVoice/models.py:175:0
    %5559 : Float(1, 192, 152, strides=[29184, 1, 192], requires_grad=0, device=cpu) = aten::randn_like(%m_p, %5554, %5555, %5556, %5557, %5558) # /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/notebooks/284-openvoice/OpenVoice/models.py:485:0
This may cause errors in trace checking. To disable trace checking, pass check_trace=False to torch.jit.trace()
  _check_trace(
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/jit/_trace.py:1102: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
The values for attribute 'shape' do not match: torch.Size([1, 1, 38912]) != torch.Size([1, 1, 37888]).
  _check_trace(
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/jit/_trace.py:1102: TracerWarning: Output nr 2. of the traced function does not match the corresponding output of the Python function. Detailed error:
The values for attribute 'shape' do not match: torch.Size([1, 1, 152, 43]) != torch.Size([1, 1, 148, 43]).
  _check_trace(
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/jit/_trace.py:1102: TracerWarning: Output nr 3. of the traced function does not match the corresponding output of the Python function. Detailed error:
The values for attribute 'shape' do not match: torch.Size([1, 1, 152]) != torch.Size([1, 1, 148]).
  _check_trace(

                                    

                                        /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/jit/_trace.py:1102: TracerWarning: Trace had nondeterministic nodes. Did you forget call .eval() on your model? Nodes:
    %1596 : Float(1, 192, 238, strides=[91392, 238, 1], requires_grad=0, device=cpu) = aten::randn_like(%m, %1591, %1592, %1593, %1594, %1595) # /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/notebooks/284-openvoice/OpenVoice/models.py:220:0
This may cause errors in trace checking. To disable trace checking, pass check_trace=False to torch.jit.trace()
  _check_trace(
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/jit/_trace.py:1102: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Tensor-likes are not close!

Mismatched elements: 25299 / 60928 (41.5%)
Greatest absolute difference: 0.04428939474746585 at index (0, 0, 60360) (up to 1e-05 allowed)
Greatest relative difference: 4310.538032454361 at index (0, 0, 8534) (up to 1e-05 allowed)
  _check_trace(

                                    

推論¶

推論デバイスの選択¶

                                            core = ov.Core()
device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='AUTO',
    description='Device:',
    disabled=False,
)
device

                                        

                                            Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

                                        

基準トーンを選択¶

まず、生成されたテキストが変換される基準の声のトーンを選択します。既存のものから選択するか、record_manuallyを選択して独自のファイルを録音するか、load_manually で独自のファイルをアップロードできます。

                                            REFERENCE_VOICES_PATH = f'{repo_dir}/resources/'
reference_speakers = [
    *[path for path in os.listdir(REFERENCE_VOICES_PATH) if os.path.splitext(path)[-1] == '.mp3'],
    'record_manually',
    'load_manually',
]

ref_speaker = widgets.Dropdown(
    options=reference_speakers,
    value=reference_speakers[0],
    description="reference voice from which tone color will be copied",
    disabled=False,
)

ref_speaker

                                        

                                            Dropdown(description='reference voice from which tone color will be copied', options=('demo_speaker2.mp3', 'de…

                                        

                                            OUTPUT_DIR = 'outputs/'
os.makedirs(OUTPUT_DIR, exist_ok=True)

                                            ref_speaker_path = f'{REFERENCE_VOICES_PATH}/{ref_speaker.value}'
allowed_audio_types = '.mp4,.mp3,.wav,.wma,.aac,.m4a,.m4b,.webm'

if ref_speaker.value == 'record_manually':
    ref_speaker_path = f'{OUTPUT_DIR}/custom_example_sample.webm'
    from ipywebrtc import AudioRecorder, CameraStream
    camera = CameraStream(constraints={'audio': True,'video':False})
    recorder = AudioRecorder(stream=camera, filename=ref_speaker_path, autosave=True)
    display(recorder)
elif ref_speaker.value == 'load_manually':
    upload_ref = widgets.FileUpload(accept=allowed_audio_types, multiple=False, description='Select audio with reference voice')
    display(upload_ref)

                                        

別のスピーチにトーンを複製する前に、リファレンス音声サンプルを再生します。

                                            def save_audio(voice_source: widgets.FileUpload, out_path: str):
    with open(out_path, 'wb') as output_file:
        assert len(voice_source.value) > 0, 'Please select audio file'
        output_file.write(voice_source.value[0]['content'])

if ref_speaker.value == 'load_manually':
    ref_speaker_path = f'{OUTPUT_DIR}/{upload_ref.value[0].name}'
    save_audio(upload_ref, ref_speaker_path)

                                        

スピーカーの埋め込みをロードします。

                                            # ffmpeg is neeeded to load mp3 and manually recorded webm files
import ffmpeg_downloader as ffdl
delimiter = ':' if sys.platform != 'win32' else ';'
os.environ['PATH'] = os.environ['PATH'] + f"{delimiter}{ffdl.ffmpeg_dir}"

                                        

                                            en_source_default_se = torch.load(f'{en_suffix}/en_default_se.pth')
en_source_style_se = torch.load(f'{en_suffix}/en_style_se.pth')
zh_source_se = torch.load(f'{zh_suffix}/zh_default_se.pth') if enable_chinese_lang else None

target_se, audio_name = se_extractor.get_se(ref_speaker_path, tone_color_converter, target_dir=OUTPUT_DIR, vad=True)  # ffmpeg must be installed

/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/librosa/util/decorators.py:88: UserWarning: PySoundFile failed. Trying audioread instead.
  return f(*args, **kwargs)

                                            [(0.0, 8.178), (9.326, 12.914), (13.262, 16.402), (16.654, 29.49225)]

                                        

                                            after vad: dur = 27.743990929705216

                                        

                                            /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/functional.py:660: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:874.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]

                                        

OpenVoiceBaseClass の元の推論メソッドを最適化された OpenVINO 推論に置き換えます。

追跡不可能で OpenVINO にオフロードできない前後の処理があります。そのような処理を自身で作成する代わりに、既存の処理を利用します。OpenVoiceBaseClass の推論関数と音声変換関数を置き換えるだけで、最も計算コストのかかる部分が OpenVINO で実行されるようになります。

                                            def get_pathched_infer(ov_model: ov.Model, device: str) -> callable:
    compiled_model = core.compile_model(ov_model, device)

    def infer_impl(x, x_lengths, sid, noise_scale, length_scale, noise_scale_w):
        ov_output = compiled_model((x, x_lengths, sid, noise_scale, length_scale, noise_scale_w))
        return (torch.tensor(ov_output[0]), )
    return infer_impl

def get_patched_voice_conversion(ov_model: ov.Model, device: str) -> callable:
    compiled_model = core.compile_model(ov_model, device)

    def voice_conversion_impl(y, y_lengths, sid_src, sid_tgt, tau):
        ov_output = compiled_model((y, y_lengths, sid_src, sid_tgt, tau))
        return (torch.tensor(ov_output[0]), )
    return voice_conversion_impl

en_base_speaker_tts.model.infer = get_pathched_infer(ov_en_tts, device.value)
tone_color_converter.model.voice_conversion = get_patched_voice_conversion(ov_voice_conversion, device.value)
if enable_chinese_lang:
    zh_base_speaker_tts.model.infer = get_pathched_infer(ov_zh_tts, device.value)

                                        

推論の実行¶

                                            voice_source = widgets.Dropdown(
    options=['use TTS', 'choose_manually'],
    value='use TTS',
    description="Voice source",
    disabled=False,
)

voice_source

                                        

                                            Dropdown(description='Voice source', options=('use TTS', 'choose_manually'), value='use TTS')

                                        

                                            if voice_source.value == 'choose_manually':
    upload_orig_voice = widgets.FileUpload(accept=allowed_audio_types, multiple=False, description='audo whose tone will be replaced')
    display(upload_orig_voice)

                                        

                                            if voice_source.value == 'choose_manually':
    orig_voice_path = f'{OUTPUT_DIR}/{upload_orig_voice.value[0].name}'
    save_audio(upload_orig_voice, orig_voice_path)
    source_se, _ = se_extractor.get_se(orig_voice_path, tone_color_converter, target_dir=OUTPUT_DIR, vad=True)
else:
    text = """
    OpenVINO toolkit is a comprehensive toolkit for quickly developing applications and solutions that solve
    a variety of tasks including emulation of human vision, automatic speech recognition, natural language processing,
    recommendation systems, and many others.
    """
    source_se = en_source_default_se
    orig_voice_path = f'{OUTPUT_DIR}/tmp.wav'
    en_base_speaker_tts.tts(text, orig_voice_path, speaker='default', language='English')

                                        

                                             > Text splitted to sentences.
OpenVINO toolkit is a comprehensive toolkit for quickly developing applications and solutions that solve a variety of tasks including emulation of human vision,
automatic speech recognition, natural language processing, recommendation systems, and many others.
 > ===========================
ˈoʊpən vino* toolkit* ɪz ə ˌkɑmpɹiˈhɛnsɪv toolkit* fəɹ kˈwɪkli dɪˈvɛləpɪŋ ˌæpləˈkeɪʃənz ənd səˈluʃənz ðət sɑɫv ə vəɹˈaɪəti əv tæsks ˌɪnˈkludɪŋ ˌɛmjəˈleɪʃən əv ˈjumən ˈvɪʒən,
 length:173
 length:173

                                        

                                            ˌɔtəˈmætɪk spitʃ ˌɹɛkɪgˈnɪʃən, ˈnætʃəɹəɫ ˈlæŋgwɪdʒ ˈpɹɑsɛsɪŋ, ˌɹɛkəmənˈdeɪʃən ˈsɪstəmz, ənd ˈmɛni ˈəðəɹz.
 length:105
 length:105

                                        

最後に、OpenVINO 最適化モデルを使用して音声変換を実行します。

                                            tau_slider = widgets.FloatSlider(
    value=0.3,
    min=0.01,
    max=2.0,
    step=0.01,
    description='tau',
    disabled=False,
    readout_format='.2f',
)
tau_slider

                                        

                                            FloatSlider(value=0.3, description='tau', max=2.0, min=0.01, step=0.01)

                                        

                                            resulting_voice_path = f'{OUTPUT_DIR}/output_with_cloned_voice_tone.wav'

tone_color_converter.convert(
    audio_src_path=orig_voice_path,
    src_se=source_se,
    tgt_se=target_se,
    output_path=resulting_voice_path,
    tau=tau_slider.value,
    message="@MyShell")

                                        

                                            Audio(orig_voice_path)

                                        

                                            Audio(resulting_voice_path)

                                        

OpenVoice Gradio オンラインアプリを実行¶

Gradio アプリを使用して、TTS と音声変換をオンラインで実行することもできます。

                                        import gradio as gr
import langid

supported_languages = ['zh', 'en']

def build_predict(output_dir, tone_color_converter, en_tts_model, zh_tts_model, en_source_default_se, en_source_style_se, zh_source_se):
    def predict(prompt, style, audio_file_pth, agree):
        return predict_impl(prompt, style, audio_file_pth, agree, output_dir, tone_color_converter, en_tts_model, zh_tts_model, en_source_default_se, en_source_style_se, zh_source_se)
    return predict

def predict_impl(prompt, style, audio_file_pth, agree, output_dir, tone_color_converter, en_tts_model, zh_tts_model, en_source_default_se, en_source_style_se, zh_source_se):
    text_hint = ''
    if not agree:
        text_hint += '[ERROR] Please accept the Terms & Condition!\n'
        gr.Warning("Please accept the Terms & Condition!")
        return (
            text_hint,
            None,
            None,
        )

    language_predicted = langid.classify(prompt)[0].strip()
    print(f"Detected language:{language_predicted}")

    if language_predicted not in supported_languages:
        text_hint += f"[ERROR] The detected language {language_predicted} for your input text is not in our Supported Languages: {supported_languages}\n"
        gr.Warning(
            f"The detected language {language_predicted} for your input text is not in our Supported Languages: {supported_languages}"
        )

        return (
            text_hint,
            None,
        )

    if language_predicted == "zh":
        tts_model = zh_tts_model
        if zh_tts_model is None:
            gr.Warning("TTS model for Chinece language was not loaded please set 'enable_chinese_lang=True`")
            return (
                text_hint,
                None,
            )
        source_se = zh_source_se
        language = 'Chinese'
        if style not in ['default']:
            text_hint += f"[ERROR] The style {style} is not supported for Chinese, which should be in ['default']\n"
            gr.Warning(f"The style {style} is not supported for Chinese, which should be in ['default']")
            return (
                text_hint,
                None,
            )

    else:
        tts_model = en_tts_model
        if style == 'default':
            source_se = en_source_default_se
        else:
            source_se = en_source_style_se
        language = 'English'
        supported_styles = ['default', 'whispering', 'shouting', 'excited', 'cheerful', 'terrified', 'angry', 'sad', 'friendly']
        if style not in supported_styles:
            text_hint += f"[ERROR] The style {style} is not supported for English, which should be in {*supported_styles,}\n"
            gr.Warning(f"The style {style} is not supported for English, which should be in {*supported_styles,}")
            return (
                text_hint,
                None,
            )

    speaker_wav = audio_file_pth

    if len(prompt) < 2:
        text_hint += "[ERROR] Please give a longer prompt text \n"
        gr.Warning("Please give a longer prompt text")
        return (
            text_hint,
            None,
        )
    if len(prompt) > 200:
        text_hint += "[ERROR] Text length limited to 200 characters for this demo, please try shorter text. You can clone our open-source repo and try for your usage \n"
        gr.Warning(
            "Text length limited to 200 characters for this demo, please try shorter text. You can clone our open-source repo for your usage"
        )
        return (
            text_hint,
            None,
        )

    # note diffusion_conditioning not used on hifigan (default mode), it will be empty but need to pass it to model.inference
    try:
        target_se, audio_name = se_extractor.get_se(speaker_wav, tone_color_converter, target_dir=OUTPUT_DIR, vad=True)
    except Exception as e:
        text_hint += f"[ERROR] Get target tone color error {str(e)} \n"
        gr.Warning(
            "[ERROR] Get target tone color error {str(e)} \n"
        )
        return (
            text_hint,
            None,
        )

    src_path = f'{output_dir}/tmp.wav'
    tts_model.tts(prompt, src_path, speaker=style, language=language)

    save_path = f'{output_dir}/output.wav'
    encode_message = "@MyShell"
    tone_color_converter.convert(
        audio_src_path=src_path,
        src_se=source_se,
        tgt_se=target_se,
        output_path=save_path,
        message=encode_message)

    text_hint += 'Get response successfully \n'

    return (
        text_hint,
        src_path,
        save_path,
    )

description = """
    # OpenVoice accelerated by OpenVINO:

    a versatile instant voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice enables granular control over voice styles, including emotion, accent, rhythm, pauses, and intonation, in addition to replicating the tone color of the reference speaker. OpenVoice also achieves zero-shot cross-lingual voice cloning for languages not included in the massive-speaker training set.
"""

content = """
<div>
<strong>If the generated voice does not sound like the reference voice, please refer to <a href='https://github.com/myshell-ai/OpenVoice/blob/main/docs/QA.md'>this QnA</a>.</strong> <strong>For multi-lingual & cross-lingual examples, please refer to <a href='https://github.com/myshell-ai/OpenVoice/blob/main/demo_part2.ipynb'>this jupyter notebook</a>.</strong>
This online demo mainly supports <strong>English</strong>. The <em>default</em> style also supports <strong>Chinese</strong>. But OpenVoice can adapt to any other language as long as a base speaker is provided.
</div>
"""
wrapped_markdown_content = f"<div style='border: 1px solid #000; padding: 10px;'>{content}</div>"


examples = [
    [
        "今天天气真好，我们一起出去吃饭吧。",
        'default',
        "OpenVoice/resources/demo_speaker1.mp3",
        True,
    ],[
        "This audio is generated by open voice with a half-performance model.",
        'whispering',
        "OpenVoice/resources/demo_speaker2.mp3",
        True,
    ],
    [
        "He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick, peppered, flour-fattened sauce.",
        'sad',
        "OpenVoice/resources/demo_speaker0.mp3",
        True,
    ],
]

def get_demo(output_dir, tone_color_converter, en_tts_model, zh_tts_model, en_source_default_se, en_source_style_se, zh_source_se):
    with gr.Blocks(analytics_enabled=False) as demo:

        with gr.Row():
            gr.Markdown(description)
        with gr.Row():
            gr.HTML(wrapped_markdown_content)

        with gr.Row():
            with gr.Column():
                input_text_gr = gr.Textbox(
                    label="Text Prompt",
                    info="One or two sentences at a time is better. Up to 200 text characters.",
                    value="He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick, peppered, flour-fattened sauce.",
                )
                style_gr = gr.Dropdown(
                    label="Style",
                    info="Select a style of output audio for the synthesised speech. (Chinese only support 'default' now)",
                    choices=['default', 'whispering', 'cheerful', 'terrified', 'angry', 'sad', 'friendly'],
                    max_choices=1,
                    value="default",
                )
                ref_gr = gr.Audio(
                    label="Reference Audio",
                    type="filepath",
                    value="OpenVoice/resources/demo_speaker2.mp3",
                )
                tos_gr = gr.Checkbox(
                    label="Agree",
                    value=False,
                    info="I agree to the terms of the cc-by-nc-4.0 license-: https://github.com/myshell-ai/OpenVoice/blob/main/LICENSE",
                )

                tts_button = gr.Button("Send", elem_id="send-btn", visible=True)


            with gr.Column():
                out_text_gr = gr.Text(label="Info")
                audio_orig_gr = gr.Audio(label="Synthesised Audio", autoplay=False)
                audio_gr = gr.Audio(label="Audio with cloned voice", autoplay=True)
                # ref_audio_gr = gr.Audio(label="Reference Audio Used")
                predict = build_predict(
                    output_dir,
                    tone_color_converter,
                    en_tts_model,
                    zh_tts_model,
                    en_source_default_se,
                    en_source_style_se,
                    zh_source_se
                )

                gr.Examples(examples,
                            label="Examples",
                            inputs=[input_text_gr, style_gr, ref_gr, tos_gr],
                            outputs=[out_text_gr, audio_gr],
                            fn=predict,
                            cache_examples=False,)
                tts_button.click(predict, [input_text_gr, style_gr, ref_gr, tos_gr], outputs=[out_text_gr, audio_orig_gr, audio_gr])
    return demo

                                    

                                        demo = get_demo(OUTPUT_DIR, tone_color_converter, en_base_speaker_tts, zh_base_speaker_tts, en_source_default_se, en_source_style_se, zh_source_se)
demo.queue(max_size=2)

try:
    demo.launch(debug=False, height=1000)
except Exception:
    demo.launch(share=True, debug=False, height=1000)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/

                                    

/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/gradio/components/dropdown.py:86: UserWarning: The max_choices parameter is ignored when multiselect is False.
  warnings.warn(

Running on local URL:  http://127.0.0.1:7860

To create a public link, set share=True in launch().

クリーンアップ¶

                                        # import shutil
# shutil.rmtree(CKPT_BASE_PATH)
# shutil.rmtree(IRS_PATH)
# shutil.rmtree(OUTPUT_DIR)