We train a regression model to derive the intelligibility scores using a combination of the vocal-specific features and singing adapted STOI, obtaining a significant improvement in performance. We perform detailed analysis on each of these features to establish their efficacy for quantifying song intelligibility. Therefore, for the second factor, we have introduced vocal-specific features to measure the intelligibility of the singing vocal, which are excitation source, spectral, and prosodic singing characteristics. The singing vocal shares the same underlying physiological mechanism for production as that of speech, with some differences in the pronunciation and prosody of the phonemes. We use U-net based audio source separation method to extract singing vocal from a polyphonic song. The singing-adapted STOI considers the polyphonic song as a time-frequency weighted noisy version of the extracted singing vocal. For the first factor, we adapt a well known method for intelligibility prediction of noisy speech, short term objective intelligibility (STOI), to singing. We leverage these two factors to determine the intelligibility of a song. Song intelligibility mostly depends on two factors, the amount of interference caused by background accompaniment, and the quality of singing vocal. ![]() The usefulness of the proposed sonority feature is demonstrated in the tasks of phoneme recognition and sonorant classification.Īn objective machine-driven measure of song intelligibility would be of great utility for various music information retrieval tasks. The combination of evidences from the three different aspects of speech provides better discrimination among different sonorant classes, compared to the baseline MFCC features. Correlation of speech over ten consecutive pitch periods is used as the suprasegmental feature representing periodicity information. A feature representing strength of excitation is derived from the Hilbert envelope of linear prediction residual, which represents the source information. A five-dimensional feature set is computed from the estimated formants to measure the prominence of the spectral peaks. It is derived from zero time windowed speech signal that provides better resolution of the formants. Vocal-tract system information is extracted from the Hilbert envelope of numerator of group delay function. In this work, the vocal-tract system, excitation source and suprasegmental features derived from the speech signal are analyzed to measure the sonority information present in each of them. ![]() Sonorant sounds are characterized by regions with prominent formant structure, high energy and high degree of periodicity.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |