Skip to the content.

AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment

Abstract

The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings while facing a major challenge: the alignment between the target (singing) pitch contour and the source (speech) content is difficult to learn in a text-free situation. This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment, we 1) adopt a novel rhythm adaptor to predict the target rhythm representation to bridge the modality gap between content and pitch, where the rhythm representation is disentangled in a simple yet effective way and is quantized into a discrete space; and 2) leverage the cross-modal aligner to re-align the content features explicitly according to the predicted rhythm and conduct a cross-modal fusion for re-synthesis. Experimental results show that AlignSTS achieves superior performance in terms of both objective and subjective metrics.

Code

PyTorch Implementation of AlignSTS can be found here.

Main Results

  1. but I do do feel that
    GT GT (HiFiGAN) Speech
    wav
    SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS
    wav
  2. it’s so very cold outside
    GT GT (HiFiGAN) Speech
    wav
    SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS
    wav
  3. in a big big world
    GT GT (HiFiGAN) Speech
    wav
    SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS
    wav
  4. I can see the first leaf falling
    GT GT (HiFiGAN) Speech
    wav
    SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS
    wav
  5. outside it’s now raining
    GT GT (HiFiGAN) Speech
    wav
    SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS
    wav
  6. but you never give
    GT GT (HiFiGAN) Speech
    wav
    SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS
    wav
  7. what you don’t understand
    GT GT (HiFiGAN) Speech
    wav
    SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS
    wav
  8. take a bullet straight through my brain
    GT GT (HiFiGAN) Speech
    wav
    SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS
    wav
  9. that’s just what you are yeah
    GT GT (HiFiGAN) Speech
    wav
    SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS
    wav
  10. yes i would die for ya baby
    GT GT (HiFiGAN) Speech
    wav
    SpeechSplit 2.0 AlignSTS (Zero-Shot) AlignSTS
    wav

Ablation Study

  1. but I do do feel that
    AlignSTS w/o RA w/o CM w/o F0
    wav
  2. it’s so very cold outside
    AlignSTS w/o RA w/o CM w/o F0
    wav
  3. in a big big world
    AlignSTS w/o RA w/o CM w/o F0
    wav
  4. I can see the first leaf falling
    AlignSTS w/o RA w/o CM w/o F0
    wav
  5. outside it’s now raining
    AlignSTS w/o RA w/o CM w/o F0
    wav