RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction

Abstract

Recent advancements in generative modeling have significantly enhanced the reconstruction of audio waveforms from various representations. While diffusion models are adept at this task, they are hindered by latency issues due to their operation at the individual sample point level and the need for numerous sampling steps. In this study, we introduce RFWave, a cutting-edge multi-band Rectified Flow approach designed to reconstruct high-fidelity audio waveforms from Mel-spectrograms or discrete acoustic tokens. RFWave uniquely generates complex spectrograms and operates at the frame level, processing all subbands simultaneously to boost efficiency. Leveraging Rectified Flow, which targets a straight transport trajectory, RFWave achieves reconstruction with just 10 sampling steps. Our empirical evaluations show that RFWave not only provides outstanding reconstruction quality but also offers vastly superior computational efficiency, enabling audio generation at speeds up to 160 times faster than real-time on a GPU.


(1) Audio Reconstruction from Mel-spectrogram

(1.1) Comparison with Diffusion Models


(1.1.1) LibriTTS (more)

Ground Truth
PriorGrad
FreGrad
RFWave

(1.1.2) Opencpop (more)

Ground Truth
PriorGrad
FreGrad
RFWave

(1.1.3) Jamendo (more)

Ground Truth
PriorGrad
FreGrad
RFWave

(1.2) Comparison with GAN Models


(1.2.1) LibriTTS TestSet  (more)

Ground Truth
Vocos
BigVGAN
RFWave

(1.2.2) MUSDB18 TestSet  (more)

Ground Truth
Vocos
BigVGAN
RFWave

(2) Audio Reconstruction from Discrete Tokens


(2.1) Speech

(2.1.1) 1.5 kbps (more)

Ground Truth
EnCodec
MBD
RFWave

(2.1.2) 3.0 kbps (more)

Ground Truth
EnCodec
MBD
RFWave

(2.1.3) 6.0 kbps (more)

Ground Truth
EnCodec
MBD
RFWave

(2.1.4) 12.0 kbps (more)

Ground Truth
EnCodec
RFWave

(2.2) Vocal

(2.2.1) 1.5 kbps (more)

Ground Truth
EnCodec
MBD
RFWave

(2.2.2) 3.0 kbps (more)

Ground Truth
EnCodec
MBD
RFWave

(2.2.3) 6.0 kbps (more)

Ground Truth
EnCodec
MBD
RFWave

(2.2.4) 12.0 kbps (more)

Ground Truth
EnCodec
RFWave

(2.3) Sound Effect

(2.3.1) 1.5 kbps (more)

Ground Truth
EnCodec
MBD
RFWave

(2.3.2) 3.0 kbps (more)

Ground Truth
EnCodec
MBD
RFWave

(2.3.3) 6.0 kbps (more)

Ground Truth
EnCodec
MBD
RFWave

(2.3.4) 12.0 kbps (more)

Ground Truth
EnCodec
RFWave

(3) Supplemental Demos


(3.1) Stable Audio Demo

(3.2) Clean Speech Demo (Mel input, comparison with diffusion models)

(3.3) Clean Speech Demo (Mel input, comparison with gan models)