Hamid Reza Baradaran Kashani, Ph.D. ,

From both perspectives of speech production and speech perception, vowels as syllable nuclei can be considered as the most significant speech events. Detection of vowel events from a speech signal is usually performed by a two-step procedure. First, a temporal objective contour (TOC), as a time-varying measure of vowel similarity, is generated from the speech signal. Second, vowel landmarks, as the places of vowel events, are extracted by locating prominent peaks of the TOC. In this paper, by employing some spectral models in a sequential manner, we propose a new framework that directly addresses three possible errors in the vowel detection problem, namely vowel deletion, consonant insertion, and vowel insertion. The proposed framework consists of three main steps as follows. At the first step, two solutions are proposed to essentially reduce the initial vowel deletion error. The first solution is to use the peaks detected by a conventional energy-based TOC, but without utilizing TOC smoothing and peak thresholding processes. The peaks detected by a spectral-based TOC generated on the basis of GMM models are also put forward as the second solution for achieving a smaller vowel deletion error. At the second step, a two-class support vector machine (SVM) classifier is adopted to identify the consonant peaks from the vowel ones. Removing the peaks classified as consonants reduces the consonant insertion error. Finally, a two-class SVM classifier is proposed to classify the consecutive peaks detected within the same vowel from the others. The merging of the peaks classified as “same vowel” considerably reduces the vowel insertion error. Experiments are separately conducted on three standard speech corpora, namely FARSDAT, TIMIT and TFARSDAT. The effectiveness of the techniques proposed to reduce three types of detection errors is verified. The criteria of total error (as the summation of three detection errors) and F-measure, respectively result in about 9.7% and 95.1% for FARSDAT, 17.5% and 91.3% for TIMIT, and 19.6% and 90.2% for the TFARSDAT corpus. The evaluation results show that the proposed framework outperforms the existing well-known methods in terms of both total error and F-measure on both read and spontaneous speech corpora. © 2018 Elsevier Ltd

2017

Vowel detection using a perceptually-enhanced spectrum matching conditioned to phonetic context and speaker identity

( Article . )

Baradaran kashani, H.R., Sayadiyan, A., Sheikhzadeh, H.

Speech Communication (01676393)91pp. 28-48

Vowel detection methods usually adopt a two-stage procedure for detecting vowel landmarks. First, a temporal objective contour (TOC), as a time-varying measure of vowel-likeness, is generated from the speech signal. Then, vowel landmarks are extracted by determining outstanding peaks of the TOC. By focusing on the TOC generation stage, this paper presents a new model based on some proposed components called matched filters (MFs). Extraction of the MFs and design of the MF-based model constitute our two main contributions. Motivated by the human auditory system, the MFs are extracted by applying a series of perceptually-based processing operations to the speech spectra of the voiced frames. Accordingly, any factor leading to the variation of the speech spectra will change the extracted MFs, too. So, it is necessary to condition the filters to the factors affecting their characteristics. Based on this fact, the proposed MF-based model is designed as the following two steps. First, an acoustic space representing two effective factors, namely phonetic context and speaker identity, is modeled. Then, vowel and consonant MFs are conditioned to this context-speaker acoustic space. Indeed, instead of using a fixed filter bank for the entire speech signal (as a popular TOC generation technique), the proposed TOC is generated by adopting a pair of vowel and consonant MFs for each voiced speech frame. Experiments are separately conducted on two standard continuous speech corpora, a Persian corpus (FARSDAT) and an English one (TIMIT). Given various experiments, it is found that all characteristics employed in the proposed model decrease the total error measure with different degrees. Using the proposed method, the total error values of 14.2% and 18.9% are obtained in clean conditions for FARSDAT and TIMIT, respectively. Moreover, the effectiveness of the proposed algorithm is verified in additive noise conditions with different signal-to-noise ratios. According to the evaluation results, the proposed method shows a desirable performance in terms of the total error in comparison with the existing well-known methods on both corpora and both clean and noisy conditions. © 2017 Elsevier B.V.

2011

Background estimation in kernel space

( Article . )

Baradaran kashani, H.R., Yazdi, H.S., Seyedin, S.A.

International Journal of Pattern Recognition and Artificial Intelligence (02180014)25(1)pp. 1-35