AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment

Abstract

The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings while facing a major challenge: the alignment between the target (singing) pitch contour and the source (speech) content is difficult to learn in a text-free situation. This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment, we 1) adopt a novel rhythm adaptor to predict the target rhythm representation to bridge the modality gap between content and pitch, where the rhythm representation is disentangled in a simple yet effective way and is quantized into a discrete space; and 2) leverage the cross-modal aligner to re-align the content features explicitly according to the predicted rhythm and conduct a cross-modal fusion for re-synthesis. Experimental results show that AlignSTS achieves superior performance in terms of both objective and subjective metrics.

Code

PyTorch Implementation of AlignSTS can be found here.

Main Results

GT: ground truth singing samples.
GT (HiFiGAN): we first convert the reference audio into mel-spectrograms and then convert them back to audio using HiFi-GAN.
Speech: the corresponding source speech samples.
SpeechSplit 2.0: the results generated by SpeechSplit 2.0.
AlignSTS (Zero-Shot): the results generated by AlignSTS in a zero-shot scenario, where the model is trained using only the singing data in a self-supervised mannar and tested on unseen speech data.
AlignSTS: the results generated by AlignSTS.

but I do do feel that

GT GT (HiFiGAN) Speech

wav

SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS

wav
it’s so very cold outside

GT GT (HiFiGAN) Speech

wav

SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS

wav
in a big big world

GT GT (HiFiGAN) Speech

wav

SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS

wav
I can see the first leaf falling

GT GT (HiFiGAN) Speech

wav

SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS

wav
outside it’s now raining

GT GT (HiFiGAN) Speech

wav

SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS

wav
but you never give

GT GT (HiFiGAN) Speech

wav

SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS

wav
what you don’t understand

GT GT (HiFiGAN) Speech

wav

SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS

wav
take a bullet straight through my brain

GT GT (HiFiGAN) Speech

wav

SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS

wav
that’s just what you are yeah

GT GT (HiFiGAN) Speech

wav

SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS

wav
yes i would die for ya baby

GT GT (HiFiGAN) Speech

wav

SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS

wav

	GT	GT (HiFiGAN)	Speech
wav

	SpeechSplit 2.0	AlignSTS (Zero-Shot)	AlignSTS
wav

	GT	GT (HiFiGAN)	Speech
wav

	SpeechSplit 2.0	AlignSTS (Zero-Shot)	AlignSTS
wav

	GT	GT (HiFiGAN)	Speech
wav

	SpeechSplit 2.0	AlignSTS (Zero-Shot)	AlignSTS
wav

	GT	GT (HiFiGAN)	Speech
wav

	SpeechSplit 2.0	AlignSTS (Zero-Shot)	AlignSTS
wav

	GT	GT (HiFiGAN)	Speech
wav

	SpeechSplit 2.0	AlignSTS (Zero-Shot)	AlignSTS
wav

	GT	GT (HiFiGAN)	Speech
wav

	SpeechSplit 2.0	AlignSTS (Zero-Shot)	AlignSTS
wav

	GT	GT (HiFiGAN)	Speech
wav

	SpeechSplit 2.0	AlignSTS (Zero-Shot)	AlignSTS
wav

	GT	GT (HiFiGAN)	Speech
wav

	SpeechSplit 2.0	AlignSTS (Zero-Shot)	AlignSTS
wav

	GT	GT (HiFiGAN)	Speech
wav

	SpeechSplit 2.0	AlignSTS (Zero-Shot)	AlignSTS
wav

	GT	GT (HiFiGAN)	Speech
wav

	SpeechSplit 2.0	AlignSTS (Zero-Shot)	AlignSTS
wav

Ablation Study

AlignSTS: basic setting.
w/o RA: without rhythm adaptor, where the rhythm representation is computed by interpolating and stretching the rhythm of input speech samples.
w/o CM: without cross-modal alignment, where we simply interpolat and stretch the content feature to be the same length of F0.
w/o F0: we cut off the skip-connection of pitch representation in the cross-modal fusion.

but I do do feel that

AlignSTS w/o RA w/o CM w/o F0

wav
it’s so very cold outside

AlignSTS w/o RA w/o CM w/o F0

wav
in a big big world

AlignSTS w/o RA w/o CM w/o F0

wav
I can see the first leaf falling

AlignSTS w/o RA w/o CM w/o F0

wav
outside it’s now raining

AlignSTS w/o RA w/o CM w/o F0

wav