AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment
Abstract
The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings while facing a major challenge: the alignment between the target (singing) pitch contour and the source (speech) content is difficult to learn in a text-free situation. This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment, we 1) adopt a novel rhythm adaptor to predict the target rhythm representation to bridge the modality gap between content and pitch, where the rhythm representation is disentangled in a simple yet effective way and is quantized into a discrete space; and 2) leverage the cross-modal aligner to re-align the content features explicitly according to the predicted rhythm and conduct a cross-modal fusion for re-synthesis. Experimental results show that AlignSTS achieves superior performance in terms of both objective and subjective metrics.
Code
PyTorch Implementation of AlignSTS can be found here.
Main Results
- GT: ground truth singing samples.
- GT (HiFiGAN): we first convert the reference audio into mel-spectrograms and then convert them back to audio using HiFi-GAN.
- Speech: the corresponding source speech samples.
- SpeechSplit 2.0: the results generated by SpeechSplit 2.0.
- AlignSTS (Zero-Shot): the results generated by AlignSTS in a zero-shot scenario, where the model is trained using only the singing data in a self-supervised mannar and tested on unseen speech data.
- AlignSTS: the results generated by AlignSTS.
- but I do do feel that
GT GT (HiFiGAN) Speech wav SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS wav - it’s so very cold outside
GT GT (HiFiGAN) Speech wav SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS wav - in a big big world
GT GT (HiFiGAN) Speech wav SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS wav - I can see the first leaf falling
GT GT (HiFiGAN) Speech wav SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS wav - outside it’s now raining
GT GT (HiFiGAN) Speech wav SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS wav - but you never give
GT GT (HiFiGAN) Speech wav SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS wav - what you don’t understand
GT GT (HiFiGAN) Speech wav SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS wav - take a bullet straight through my brain
GT GT (HiFiGAN) Speech wav SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS wav - that’s just what you are yeah
GT GT (HiFiGAN) Speech wav SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS wav - yes i would die for ya baby
GT GT (HiFiGAN) Speech wav SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS wav
Ablation Study
- AlignSTS: basic setting.
- w/o RA: without rhythm adaptor, where the rhythm representation is computed by interpolating and stretching the rhythm of input speech samples.
- w/o CM: without cross-modal alignment, where we simply interpolat and stretch the content feature to be the same length of F0.
- w/o F0: we cut off the skip-connection of pitch representation in the cross-modal fusion.
- but I do do feel that
AlignSTS w/o RA w/o CM w/o F0 wav - it’s so very cold outside
AlignSTS w/o RA w/o CM w/o F0 wav - in a big big world
AlignSTS w/o RA w/o CM w/o F0 wav - I can see the first leaf falling
AlignSTS w/o RA w/o CM w/o F0 wav - outside it’s now raining
AlignSTS w/o RA w/o CM w/o F0 wav