VocV1(VogenVoc): Audio Samples

Demo page for paper “A Fast High-Fidelity Source-Filter Vocoder with Lightweight Neural Modules”

Abstract The quality of raw audio waveform generated by a vocoder could affect various audio generative tasks. In recent years, the dominance of source-filter vocoders was greatly challenged by neural vocoders as the latter presents far superior synthesized audio quality. Meanwhile, neural vocoders introduced unprecedented limitations including low runtime efficiency as well as unstable pitch especially in those without explicit periodic excitation input, while these have never been a problem in source-filter vocoders. We present in this paper a novel approach that takes the best from both parties. We start by an in-depth examination of every building block in WORLD – one of the best-performing source-filter vocoders based on plain signal processing algorithms, looking for ones that do not work well, and we replace them with small, lightweight and task-specific neural network models. We also rearranged the vocoding pipeline for a smoother collaboration between building blocks. Our objective and subjective evaluations demonstrate that our methods present competitive synthesized audio quality even when compared against neural vocoders at a much lower computational cost, while keeping spectral envelope acoustic feature, high pitch accuracy as in conventional source-filter vocoders.

Figure showing synthesis pipeline used in this work. Red-orange means complex-valued; blue-gray means real-valued.

Figure showing runtime performance comparison among vocoders under various hardware settings. Numbers are time cost in milliseconds per synthesis of a 1-second audio. All tests were run on a single CPU thread.

HiFi-GAN

UnivNet

Use headphones for best experience