On Metric-based Deep Embedding Learning for Text-Independent Speaker Verification
Abstract
As a state-of-the-art solution for speaker verification problems, deep neural networks have been usefully employed for extracting speaker embeddings which represent speaker informative features. Objective functions, as the supervisors for the learning of discriminative embeddings, play a crucial role for this purpose. In this paper, motivated by the success of metric learning approaches, we investigate four newly proposed metrics in the literature, specifically for the speaker verification problem. For deeper comparisons, we consider these metrics from both main groups of metric-based objectives, i.e. instance-based and proxy-based ones. By considering embeddings as instances, the first group exploits the instance-to-instance relations, while the latter associates the instances to the proxies as representatives of training samples.Evaluations in terms of Equal Error Rate (EER) are conducted in two conventional manners: end-to-end and modular where cosine similarity and PLDA are applied to embeddings, respectively. Experimental results show that in the case of end-to-end, instances-based metrics outperform proxy-based ones, while interestingly the opposite behavior is gained for the modular case. Finally, the lowest EER is achieved by adopting one of the proxy-based metrics, namely SoftTriple, in the modular manner. It yields relative improvements up to 12% compared to the state-of-the art method, i.e. x-vector. © 2020 IEEE.