This repository implements the singing voice conversion method described inPitchnet: Unsupervised Singing Voice Conversion with Pitch Adversarial Networkalong with multiple improvements regarding its conversion quality using PyTorch. Detailed surveys and experiments have been published as a master thesis, you can get ithere.
You can find demo audio files and comparisons to the original PitchNet on ourdemo website.
We useNUS-48Edataset throughout the whole project. You can download it and perform data preprocessing and augmentation below.
Create a conda environment usingenvironment.yml
:
conda env create -f environment.yml
Notice: Make sure you are under the project root when executing these scripts!
This script will read through the given$raw_dir
and generate folders with the same structure to$output_dir
,containing augmented audio files next to the original ones.
Python data_augmentation.py $raw_dir $output_dir --aug-type $aug_type
raw_dir
:Path to the raw data directory with the following structure:
-> $raw_dir/
├── ADIZ
│ ├── 01.wav
│ ├── 09.wav
│ ├── 13.wav
│ └── 18.wav
├── JLEE
│ ├── 05.wav
│ ├── 08.wav
│ ├── 11.wav
│ └── 15.wav
...
output_dir
:Path to the directory to save the augmented and original files. The resulting structure will look like this:
-> $output_dir/
├── ADIZ
│ ├── 01_original.wav
│ ├── 01_aug_back.wav
│ ├── 01_aug_phase.wav
│ ├── 01_aug_back_phase.wav
│ ├── 09_original.wav
│ ├── 09_aug_back.wav
│ ├── 09_aug_phase.wav
│ ├── 09_aug_back_phase.wav
...
...
aug_type
:Type of augmentation
This script will read through the given$raw_dir
and generate folders with the same structure to$output_dir
,with each audio file processed as a*.h5
data file ready to be read by dataset classes.
Python data_preprocess.py $raw_dir $output_dir --model $model
raw_dir
:Path to the raw data directoryoutput_dir
:Path to the directory to save the processed filesmodel
:Target model type which we are doing data preprocessing for
This script will train the model. If--model-path
is given, the training will continue with that checkpoint. To see other training parameters, run the script with-h
.
Python train.py $train_data_dir $model_dir --model $model --model-path $model_path
train_data_dir
:Path to the processed data directorymodel_dir
:Directory to save checkpoint modelsmodel
:Target model typemodel_path
:Path to pretrained model
You can get our pretrained proposed modelhere.
This script will perform singing voice conversion on the given audio file. For two-phase conversion, the intermediate files will be saved to.tmp/
directory.
Python inference.py $src_file $target_dir $singer_id $model_path --pitch-shift $pitch_shift --two-phase --train-data-dir $train_data_dir
src_file
:Path to the source audio filetarget_dir
:Path to save the converted audio filesinger_id
:Target singer ID (name)model_path
:Model pathpitch_shift
:Factor of pitch shifting performed on conversion, or "auto" for automatic pitch range shiftingtwo_phase
:Whether or not to perform two-phase conversiontrain_data_dir
:The original training data used for two-phase conversion
This script will plot the training loss curves of a given checkpoint. The output image will be stored inplotting-scripts/plotting-results/
.
Python plotting-scripts/plot_loss.py $checkpoint_path --window-size $window_size --loss-types $loss_types
checkpoint_path
:Path to the target training checkpointwindow_size
:Window size for moving averageloss_types
:Target types of loss separated by spaces
This script will plot the pitch extracted from the given audio file.
Python plotting-scripts/plot_pitch.py $src_file
src_file
:Path to the source audio file
This script will plot the audio duration histogram of the given dataset.
Python plotting-scripts/plot_hist.py $raw_dir
raw_dir
:Path to the raw data directory
This script will plot the pitch histogram of the given dataset.
Python plotting-scripts/plot_pitch_hist.py $raw_dir
raw_dir
:Path to the raw data directory
This script will plot the spectrogram of the given audio file.
Python plotting-scripts/plot_spec.py $src_file
src_file
:Path to the source audio file
This script will conduct simple unit tests and print out a model summary (if applicable). Run with-h
option to see all available networks.
Python test_network.py $target_net
This script will select random N seconds segment for each raw audio file in the given data directory and output it as a mini dataset.
Python evaluation/select_data.py $raw_dir $output_dir --seg-len $seg_len
raw_dir
:Path to the raw data directoryoutput_dir
:Path to the directory to save the processed filesseg_len
:Length (seconds) for each segment
This script will perform evaluation given evaluation data directory, output file directory, and the target model.
Python evaluation/evaluate.py $raw_dir $output_dir $model_path $sc_model_path $mapping --pitch-shift --two-phase --train-data-dir
raw_dir
:Path to the evaluation data directoryoutput_dir
:Path to the directory to save converted audio filesmodel_path
:Path to the target model to evaluatesc_model_path
:Path to the singer classifier modelmapping
:The mapping config of the conversion pairspitch_shift
:Whether or not to perform pitch shiftingtwo_phase
:Whether or not to perform two-phase conversiontrain_data_dir
:The original training data used for two-phase conversion
You can get the singer classifier model we used in the evaluationhere.
Below is the hardware used in these experiments and correspoding training & inference time for people who are interested in trying out the project. For more detailed analysis and experiment results, please refer to thethesis.
Part | Specification |
---|---|
CPU | Intel(R) Core(TM) i9-9820X CPU @ 3.30GHz |
RAM | 125GB |
GPU | TITAN RTX x2 |
Disk | PLEXTOR PX-512M9PeGN |
A complete training (300000 steps) takes around40 hours.
Convertingone secondof audio file takes around3 minutes.
- This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
- We referencedfacebookresearch/music-translation,which has the samelicense,for WaveNet implementation and made modifications accordingly to fit our usages.
- pytorch-summaryis used in this repo, which is licensed under aMIT License
@article{songrong2021svc,
title = {Unsupervised WaveNet-based Singing Voice Conversion Using Pitch Augmentation and Two-phase Approach},
author = {Lee, Songrong},
journal = {Graduate Institute of Networking and Multimedia, National Taiwan University Master Thesis},
pages = {1--56},
year = {2021},
publisher = {National Taiwan University}
}