VALL-E：神经编解码器语言模型

概述

我们介绍了一种用于文本到语音合成 (TTS) 的语言建模方法。具体来说，我们使用从现成的神经音频编解码器模型派生的离散代码来训练神经编解码器语言模型（称为 VALL-E），并将 TTS 视为条件语言建模任务，而不是像以前的工作那样连续信号回归。在预训练阶段，我们将 TTS 训练数据扩展到 60K 小时的英语语音，这是现有系统的数百倍。VALL-E 出现了上下文学习能力，可用于合成高质量的个性化语音，只需录制 3 秒的未见过的说话者的注册录音作为声音提示。实验结果表明，VALL-E 在语音自然度和说话人相似度方面明显优于最先进的零样本 TTS 系统。

特征

VALL-E的流水线是音素→离散码→波形 VALL-E 根据音素和声学代码提示生成离散音频编解码器代码 VALL-E直接赋能各种语音合成应用零镜头 TTS、语音编辑和内容创建与 GPT-3 等其他生成式 AI 模型相结合 VALL-E 可以合成个性化语音，同时保持说话人提示的声学环境

[hidecontent type="logged" desc="隐藏内容：登录后可查看"]

VALL-E 的非官方 PyTorch 实现（神经编解码器语言模型是零样本文本到语音合成器）。

我们可以在一个 GPU 上训练 VALL-E 模型。

演示

广泛的影响

由于 VALL-E 可以合成保持说话人身份的语音，它可能会带来滥用模型的潜在风险，例如欺骗语音识别或冒充特定说话人。

为避免滥用，将不提供训练有素的模型和服务。

安装

要快速启动并运行，只需按照以下步骤操作：

# PyTorch
pip install torch==1.13.1 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip install torchmetrics==0.11.1
# fbank
pip install librosa==0.8.1

# phonemizer pypinyin
apt-get install espeak-ng
## OSX: brew install espeak
pip install phonemizer==3.2.1 pypinyin==0.48.0

# lhotse update to newest version
# https://github.com/lhotse-speech/lhotse/pull/956
# https://github.com/lhotse-speech/lhotse/pull/960
pip uninstall lhotse
pip uninstall lhotse
pip install git+https://github.com/lhotse-speech/lhotse

# k2
# find the right version in https://huggingface.co/csukuangfj/k2
pip install https://huggingface.co/csukuangfj/k2/resolve/main/cuda/k2-1.23.4.dev20230224+cuda11.6.torch1.13.1-cp310-cp310-linux_x86_64.whl

# icefall
git clone https://github.com/k2-fsa/icefall
cd icefall
pip install -r requirements.txt
export PYTHONPATH=`pwd`/../icefall:$PYTHONPATH
echo "export PYTHONPATH=`pwd`/../icefall:\$PYTHONPATH" >> ~/.zshrc
echo "export PYTHONPATH=`pwd`/../icefall:\$PYTHONPATH" >> ~/.bashrc
cd -
source ~/.zshrc

# valle
git clone https://github.com/lifeiteng/valle.git
cd valle
pip install -e .

训练&推理

英文例子examples/libritts/README.md
中文例子examples/aishell1/README.md
NAR 解码器的前缀模式 0 1 2 4
论文第 5.1 章“LibriLight 中波形的平均长度为 60 秒。在训练期间，我们将波形随机裁剪为 10 秒到 20 秒之间的随机长度。对于 NAR acoustic prompt token，我们选择了 3 的随机段波形同一句话的几秒钟。”

0 : 没有声音提示符 1 : 当前批量话语的随机前缀（推荐） 2：当前批处理话语的随机片段 3: 与论文相同（因为他们将长波形随机裁剪为多个话语，所以相同的话语意味着相同长波形中的前或后话语。）

# If train NAR Decoders with prefix_mode 4
python3 bin/trainer.py --prefix_mode 4 --dataset libritts --input-strategy PromptedPrecomputedFeatures ...

LibriTTS demo在一个 24G 显存的 GPU 上训练

cd examples/libritts

# step1 prepare dataset
bash prepare.sh --stage -1 --stop-stage 3

# step2 train the model on one GPU with 24GB memory
exp_dir=exp/valle

## Train AR model
python3 bin/trainer.py --max-duration 80 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 1 \
      --num-buckets 6 --dtype "bfloat16" --save-every-n 10000 --valid-interval 20000 \
      --model-name valle --share-embedding true --norm-first true --add-prenet false \
      --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 \
      --base-lr 0.05 --warmup-steps 200 --average-period 0 \
      --num-epochs 20 --start-epoch 1 --start-batch 0 --accumulate-grad-steps 4 \
      --exp-dir ${exp_dir}

## Train NAR model
cp ${exp_dir}/best-valid-loss.pt ${exp_dir}/epoch-2.pt  # --start-epoch 3=2+1
python3 bin/trainer.py --max-duration 40 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 2 \
      --num-buckets 6 --dtype "float32" --save-every-n 10000 --valid-interval 20000 \
      --model-name valle --share-embedding true --norm-first true --add-prenet false \
      --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 \
      --base-lr 0.05 --warmup-steps 200 --average-period 0 \
      --num-epochs 40 --start-epoch 3 --start-batch 0 --accumulate-grad-steps 4 \
      --exp-dir ${exp_dir}

# step3 inference
python3 bin/infer.py --output-dir infer/demos \
    --model-name valle --norm-first true --add-prenet false \
    --share-embedding true --norm-first true --add-prenet false \
    --text-prompts "KNOT one point one five miles per hour." \
    --audio-prompts ./prompts/8463_294825_000043_000000.wav \
    --text "To get up and running quickly just follow the steps below." \
    --checkpoint=${exp_dir}/best-valid-loss.pt

# Demo Inference
https://github.com/lifeiteng/lifeiteng.github.com/blob/main/valle/run.sh#L68

故障排除

SummaryWriter segmentation fault (core dumped)

线tb_writer = SummaryWriter(log_dir=f"{params.exp_dir}/tensorboard")
修复 https://github.com/tensorflow/tensorboard/pull/6135/files

file=`python  -c 'import site; print(f"{site.getsitepackages()[0]}/tensorboard/summary/writer/event_file_writer.py")'`
sed -i 's/import tf/import tensorflow_stub as tf/g' $file

在自定义数据集上训练？

准备数据集lhotse manifests
- 这里有很多参考资料lhotse/recipes
python3 bin/tokenizer.py ...
python3 bin/trainer.py ...

引用

要引用此存储库：

@misc{valle,
  author={Feiteng Li},
  title={VALL-E: A neural codec language model},
  year={2023},
  url={http://github.com/lifeiteng/vall-e}
}

@article{VALL-E,
  title     = {Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers},
  author    = {Chengyi Wang, Sanyuan Chen, Yu Wu,
               Ziqiang Zhang, Long Zhou, Shujie Liu,
               Zhuo Chen, Yanqing Liu, Huaming Wang,
               Jinyu Li, Lei He, Sheng Zhao, Furu Wei},
  year      = {2023},
  eprint    = {2301.02111},
  archivePrefix = {arXiv},
  volume    = {abs/2301.02111},
  url       = {http://arxiv.org/abs/2301.02111},
}

[/hidecontent]

概述

特征

演示

广泛的影响

安装

训练&推理

英文例子examples/libritts/README.md

中文例子examples/aishell1/README.md

NAR 解码器的前缀模式 0 1 2 4

LibriTTS demo在一个 24G 显存的 GPU 上训练

故障排除

在自定义数据集上训练？

引用