RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction

Abstract

Recent advancements in generative modeling have significantly enhanced the reconstruction of audio waveforms from various representations. While diffusion models are adept at this task, they are hindered by latency issues due to their operation at the individual sample point level and the need for numerous sampling steps. In this study, we introduce RFWave, a cutting-edge multi-band Rectified Flow approach designed to reconstruct high-fidelity audio waveforms from Mel-spectrograms or discrete tokens. RFWave uniquely generates complex spectrograms and operates at the frame level, processing all subbands simultaneously to boost efficiency. Leveraging Rectified Flow, which targets a flat transport trajectory, RFWave achieves reconstruction with just 10 sampling steps. Our empirical evaluations show that RFWave not only provides outstanding reconstruction quality but also offers vastly superior computational efficiency, enabling audio generation at speeds up to 97 times faster than real-time on a GPU.

Configuration Definitions

+CFG2 : Classifier-free guidance with a guidance coefficient of 2.0
+STFT : Applied STFT loss


(1) Audio Reconstruction from Mel-spectrogram


(1.1) LibriTTS (more)

Ground Truth
Vocos
PriorGrad
FreGrad
RFWave
RFWave + STFT
RFWave + CFG2
RFWave + CFG2 + STFT

(1.2) Opencpop (more)

Ground Truth
Vocos
PriorGrad
FreGrad
RFWave
RFWave + STFT
RFWave + CFG2
RFWave + CFG2 + STFT

(1.3) Jamendo (more)

Ground Truth
Vocos
PriorGrad
FreGrad
RFWave
RFWave + STFT
RFWave + CFG2
RFWave + CFG2 + STFT

(2) Audio Reconstruction from Discrete Tokens


(2.1) Speech

(2.1.1) 1.5 kbps (more)

Ground Truth
EnCodec
MBD
RFWave
RFWave + STFT
RFWave + CFG2
RFWave + CFG2 + STFT

(2.1.2) 3.0 kbps (more)

Ground Truth
EnCodec
MBD
RFWave
RFWave + STFT
RFWave + CFG2
RFWave + CFG2 + STFT

(2.1.3) 6.0 kbps (more)

Ground Truth
EnCodec
MBD
RFWave
RFWave + STFT
RFWave + CFG2
RFWave + CFG2 + STFT

(2.1.4) 12.0 kbps (more)

Ground Truth
EnCodec
RFWave
RFWave + STFT
RFWave + CFG2
RFWave + CFG2 + STFT

(2.2) Vocal

(2.2.1) 1.5 kbps (more)

Ground Truth
EnCodec
MBD
RFWave
RFWave + STFT
RFWave + CFG2
RFWave + CFG2 + STFT

(2.2.2) 3.0 kbps (more)

Ground Truth
EnCodec
MBD
RFWave
RFWave + STFT
RFWave + CFG2
RFWave + CFG2 + STFT

(2.2.3) 6.0 kbps (more)

Ground Truth
EnCodec
MBD
RFWave
RFWave + STFT
RFWave + CFG2
RFWave + CFG2 + STFT

(2.2.4) 12.0 kbps (more)

Ground Truth
EnCodec
RFWave
RFWave + STFT
RFWave + CFG2
RFWave + CFG2 + STFT

(2.3) Sound Effect

(2.3.1) 1.5 kbps (more)

Ground Truth
EnCodec
MBD
RFWave
RFWave + STFT
RFWave + CFG2
RFWave + CFG2 + STFT

(2.3.2) 3.0 kbps (more)

Ground Truth
EnCodec
MBD
RFWave
RFWave + STFT
RFWave + CFG2
RFWave + CFG2 + STFT

(2.3.3) 6.0 kbps (more)

Ground Truth
EnCodec
MBD
RFWave
RFWave + STFT
RFWave + CFG2
RFWave + CFG2 + STFT

(2.3.4) 12.0 kbps (more)

Ground Truth
EnCodec
RFWave
RFWave + STFT
RFWave + CFG2
RFWave + CFG2 + STFT

(3) Examples of audio generation integrated with other models

(3.1) Example With Bark (more)

EnCodec
MBD
RFWave

011 prompt: He dashed back across the road, hurried up to his office, snapped at his secretary not to disturb him, seized his telephone, and had almost finished dialing his home number when he changed his mind.

012 prompt: He'd forgotten all about the people in cloaks until he passed a group of them next to the baker's.

023 prompt: ♪♪ [piano] ♪♪

(3.2) Example With ChatTTS (more)

ChatTTS + Vocos
ChatTTS + RFWave

(6) prompt: He dashed back across the road, hurried up to his office, snapped at his secretary not to disturb him, seized his telephone, and had almost finished dialing his home number when he changed his mind.



(7) prompt: He'd forgotten all about the people in cloaks until he passed a group of them next to the baker's.



(8) prompt: "Sorry," he grunted, as the tiny old man stumbled and almost fell.


(4) Supplementary Vocoder Demo (with bigvgan)

(4.1) Compare the generalization ability of vocoders on out-of-domain data Opencpop, all models trained on LibriTTS (more)

BigVGAN
RFWave
BigVGan Spectrogram RFWave Spectrogram

(4.2) Demo on MusDB (Model Trained on LibrTTS or Universal Dataset, auduio synthesized on MusDB DataSet)

(4.2.1) MusDB vocal (more audio examples    spectrogram example pictures)

Ground Truth
Vocos
PriorGrad
FreGrad
BigVGAN-base
BigVGAN
RFWave
BigVGAN-v2
RFWave-Universal

(4.2.2) MusDB drums (more)

Ground Truth
Vocos
PriorGrad
FreGrad
BigVGAN-base
BigVGAN
RFWave
BigVGAN-v2
RFWave-Universal

(4.2.3) MusDB bass (more)

Ground Truth
Vocos
PriorGrad
FreGrad
BigVGAN-base
BigVGAN
RFWave
BigVGAN-v2
RFWave-Universal

(4.2.4) MusDB other (more)

Ground Truth
Vocos
PriorGrad
FreGrad
BigVGAN-base
BigVGAN
RFWave
BigVGAN-v2
RFWave-Universal

(4.2.5) MusDB mixture (more)

Ground Truth
Vocos
PriorGrad
FreGrad
BigVGAN-base
BigVGAN
RFWave
BigVGAN-v2
RFWave-Universal