Deep attentive adaptive filter module in residual blocks for text-independent speaker verification
Abstract
Text-independent speaker verification is a challenging research field in biometric user authentication as a major application of artificial intelligence. It determines whether two content-unconstrained utterances come from the same or different speakers. Deep neural architectures typically extract temporal-frequency feature maps with local Receptive Field (RF). Advanced architectures have also extracted features with the global RF, however, merely over the temporal dimension. To effectively leverage the feature maps with the global RF over the frequency dimension, this work proposes a Deep Attentive Adaptive Filter (DAAF) module. It first applies a Fourier transform to the frequency dimension to yield features in the new spectral dimension. Updating each spectral value in the Fourier domain affects all frequency values in the original domain. The feature maps are further boosted by introducing attention-based adaptive filtering in the Fourier domain. The filtering adaptively modulates spectral components to the temporal and channel dimensions. By adopting the residual convolutional networks, the DAAF module is then conducted in a parallel way to a residual block to capture features with both global and local frequency RFs. The multi-RF features are finally combined through a novel attentive feature fusion module. Comprehensive experiments are conducted on two benchmark corpora under two in-domain and out-of-domain scenarios. The superior performances of proposed networks are verified regarding two criteria the equal error rate and the minimum of the detection cost function, however, by effectively utilizing the low number of learnable parameters. © 2024 Elsevier Ltd