While listening to spoken content, it is often desired to vary the speech rate while preserving the speaker’s timbre and pitch.
To date, advanced signal processing techniques are used to address this task,
but it still remains a challenge to maintain a high speech quality at all time-scales.
Inspired by the success of speech generation using Generative Adversarial Networks (GANs),
we propose a novel unsupervised learning algorithm for time-scale modification (TSM) of speech, called ScalerGAN.
The model is trained using a set of speech utterances, where no time-scales are provided.
The ScalerGAN algorithm is composed of a generator that gets as input speech with the desired rate and outputs a time-adjusted speech;
a discriminator that works on various spectrum scales;
and a decoder that converts the time-adjusted signal back to the original rate to maintain consistency.
Using an A/B test and conditional A/B test, human listeners were asked to compare ScalerGAN with other state-of- the-art TSM methods.
The results showed that the speech quality of ScalerGAN outperforms all other methods.
Driedger, J., & Müller, M. (2014). TSM Toolbox: MATLAB Implementations of Time-Scale Modification Algorithms. Proceedings of the International Conference on Digital Audio Effects (DAFx), 249–256.
[ESOLA] Rudresh, S., Vasisht, A., Vijayan, K., & Seelamantula, C. S. (2018). Epoch-Synchronous Overlap-Add (ESOLA) for Time- and Pitch-Scale Modification of Speech Signals. [Link]
[WSOLA] Verhelst, W., & Roelands, M. (1993). An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2, 554–557 vol.2. [Link]
[FESOLA] Roberts, T., & Paliwal, K. K. (2019). Time-Scale Modification Using Fuzzy Epoch-Synchronous Overlap-Add (FESOLA). 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 31–34. [Link]
[HPTSM] Driedger, J., Müller, M., & Ewert, S. (2014). Improving Time-Scale Modification of Music Signals Using Harmonic-Percussive Separation. IEEE Signal Processing Letters, 21(1), 105–109. [Link]
[IPL], [SPL] Laroche, J., & Dolson, M. (1999). Improved phase vocoder time-scale modification of audio. IEEE Transactions on Speech and Audio Processing, 7(3), 323–332. [Link]
[Phavorit_IPL], [Phavorit_IPL] Karrer, T., Lee, E., & Borchers, J. O. (2006). PhaVoRIT: A Phase Vocoder for Real-Time Interactive Time-Stretching. ICMC. [Link]
[PV] Portnoff, M. (1976). Implementation of the digital phase vocoder using the fast Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(3), 243–248. [Link]
[uTVS] Sharma, N., Potadar, S., Chetupalli, S. R., & Sreenivas, T. V. (2017). Mel-scale sub-band modelling for perceptually improved time-scale modification of speech and audio signals. 2017 Twenty-Third National Conference on Communications (NCC), 1–5. [Link]
[FuzzyPV] Damskägg, E.-P., & Välimäki, V. (2017). Audio time stretching using fuzzy classification of spectral bins. Applied Sciences, 7(12), 1293. [Link]
[Elastique] Zplane Development. “Elastique time stretching & pitch shifting sdks computer program (version 3.2.5) [Link].