Sequential use of spectral models to reduce deletion and insertion errors in vowel detection
Abstract
From both perspectives of speech production and speech perception, vowels as syllable nuclei can be considered as the most significant speech events. Detection of vowel events from a speech signal is usually performed by a two-step procedure. First, a temporal objective contour (TOC), as a time-varying measure of vowel similarity, is generated from the speech signal. Second, vowel landmarks, as the places of vowel events, are extracted by locating prominent peaks of the TOC. In this paper, by employing some spectral models in a sequential manner, we propose a new framework that directly addresses three possible errors in the vowel detection problem, namely vowel deletion, consonant insertion, and vowel insertion. The proposed framework consists of three main steps as follows. At the first step, two solutions are proposed to essentially reduce the initial vowel deletion error. The first solution is to use the peaks detected by a conventional energy-based TOC, but without utilizing TOC smoothing and peak thresholding processes. The peaks detected by a spectral-based TOC generated on the basis of GMM models are also put forward as the second solution for achieving a smaller vowel deletion error. At the second step, a two-class support vector machine (SVM) classifier is adopted to identify the consonant peaks from the vowel ones. Removing the peaks classified as consonants reduces the consonant insertion error. Finally, a two-class SVM classifier is proposed to classify the consecutive peaks detected within the same vowel from the others. The merging of the peaks classified as “same vowel” considerably reduces the vowel insertion error. Experiments are separately conducted on three standard speech corpora, namely FARSDAT, TIMIT and TFARSDAT. The effectiveness of the techniques proposed to reduce three types of detection errors is verified. The criteria of total error (as the summation of three detection errors) and F-measure, respectively result in about 9.7% and 95.1% for FARSDAT, 17.5% and 91.3% for TIMIT, and 19.6% and 90.2% for the TFARSDAT corpus. The evaluation results show that the proposed framework outperforms the existing well-known methods in terms of both total error and F-measure on both read and spontaneous speech corpora. © 2018 Elsevier Ltd