Vowel detection using a perceptually-enhanced spectrum matching conditioned to phonetic context and speaker identity
Abstract
Vowel detection methods usually adopt a two-stage procedure for detecting vowel landmarks. First, a temporal objective contour (TOC), as a time-varying measure of vowel-likeness, is generated from the speech signal. Then, vowel landmarks are extracted by determining outstanding peaks of the TOC. By focusing on the TOC generation stage, this paper presents a new model based on some proposed components called matched filters (MFs). Extraction of the MFs and design of the MF-based model constitute our two main contributions. Motivated by the human auditory system, the MFs are extracted by applying a series of perceptually-based processing operations to the speech spectra of the voiced frames. Accordingly, any factor leading to the variation of the speech spectra will change the extracted MFs, too. So, it is necessary to condition the filters to the factors affecting their characteristics. Based on this fact, the proposed MF-based model is designed as the following two steps. First, an acoustic space representing two effective factors, namely phonetic context and speaker identity, is modeled. Then, vowel and consonant MFs are conditioned to this context-speaker acoustic space. Indeed, instead of using a fixed filter bank for the entire speech signal (as a popular TOC generation technique), the proposed TOC is generated by adopting a pair of vowel and consonant MFs for each voiced speech frame. Experiments are separately conducted on two standard continuous speech corpora, a Persian corpus (FARSDAT) and an English one (TIMIT). Given various experiments, it is found that all characteristics employed in the proposed model decrease the total error measure with different degrees. Using the proposed method, the total error values of 14.2% and 18.9% are obtained in clean conditions for FARSDAT and TIMIT, respectively. Moreover, the effectiveness of the proposed algorithm is verified in additive noise conditions with different signal-to-noise ratios. According to the evaluation results, the proposed method shows a desirable performance in terms of the total error in comparison with the existing well-known methods on both corpora and both clean and noisy conditions. © 2017 Elsevier B.V.