filter by:
Articles
IEEE Transactions on Geoscience and Remote Sensing (15580644)63
In this article, a multimodal deep architecture for classification of light detection and ranging (LiDAR) and hyperspectral image (HSI) is proposed, acquiring the knowledge of both modalities by leveraging modality-specific information and their complementary information. The proposed model consists of two main steps. First, to improve the performance of a 2-D convolutional neural network (2DCNN), low-frequency features with a maximum autocorrelation factor of HSI are injected into 2DCNN, which are called multiscale features of 2DCNN. Second, to improve the accuracy of 2DCNN and extract smooth and semantic information, the posterior energy of hidden Markov random field (HMRF) is modified by using Gaussian attention and albedo recovery attention mechanisms and energies of LiDAR and HMRF. Then, these features are fused based on another attention mechanism called attention-based HMRF. Moreover, this HMRF model is used for the fusion of HSI and LiDAR. The proposed model is tested on the Houston 2013, Trento, and MUUFL datasets and compared with several state-of-the-art methods. The resulting classification accuracies through the ablation study show the superior performance of the proposed method. © 1980-2012 IEEE.
Text summarization is a valuable method for extracting important details from large volumes of text data, facilitating tasks like text data analysis. Various text summarization techniques have been developed over time, with some focusing on selecting and summarizing short sentences, while others overlook the semantic relationship between sentences. Extractive document summarization involves learning cross-sentence relations, a critical aspect that has been extensively explored using various approaches. One effective method is to employ neural networks based on graphs, which offer an intricate structure capable of obtaining relations among sentences. In this paper, we present a contextualized heterogeneous graph neural network for extractive text summarization (ConHGNN-SUM), incorporating semantic nodes that extend beyond individual sentences, and emphasizes the importance of capturing the relationship between selected sentences as a final step in the summarization process. These extra nodes function as intermediaries connecting sentences and enhancing the interrelationships between them. Our model enhances conventional graph-based extractive methods and delivers comparable performance to other advanced systems for extractive summarization. © 2024 IEEE.
Engineering Applications of Artificial Intelligence (09521976)138
Text-independent speaker verification is a challenging research field in biometric user authentication as a major application of artificial intelligence. It determines whether two content-unconstrained utterances come from the same or different speakers. Deep neural architectures typically extract temporal-frequency feature maps with local Receptive Field (RF). Advanced architectures have also extracted features with the global RF, however, merely over the temporal dimension. To effectively leverage the feature maps with the global RF over the frequency dimension, this work proposes a Deep Attentive Adaptive Filter (DAAF) module. It first applies a Fourier transform to the frequency dimension to yield features in the new spectral dimension. Updating each spectral value in the Fourier domain affects all frequency values in the original domain. The feature maps are further boosted by introducing attention-based adaptive filtering in the Fourier domain. The filtering adaptively modulates spectral components to the temporal and channel dimensions. By adopting the residual convolutional networks, the DAAF module is then conducted in a parallel way to a residual block to capture features with both global and local frequency RFs. The multi-RF features are finally combined through a novel attentive feature fusion module. Comprehensive experiments are conducted on two benchmark corpora under two in-domain and out-of-domain scenarios. The superior performances of proposed networks are verified regarding two criteria the equal error rate and the minimum of the detection cost function, however, by effectively utilizing the low number of learnable parameters. © 2024 Elsevier Ltd
In today's digital age, the comprehension and prediction of human personality traits have assumed paramount significance. This study embarks on the task of forecasting the Big Five personality traits through textual data, harnessing the capabilities of advanced natural language processing models. The focal dataset is the ChaLearn First Impressions V2, a treasure trove of human-generated text coupled with Big Five personality trait labels. A diverse array of models undergo scrutiny, ranging from basic deep learning models like Deep Pyramid Convolutional Neural Network (DPCNN) and Hierarchical Attention Network (HAN) to cutting-edge transformer-based architectures such as BERT and FLAN-T5. These models undergo meticulous evaluation across various training scenarios, spanning scenarios where all layers are fine-tuned, only the embedding layer is freezed, and the complete layer freezing, with exclusive attention to Transformer models. Notably, models such as DPCNN and HAN emerge as stars, boasting remarkable accuracy attributable to their prowess in hierarchical feature extraction. Conversely, Transformer models like ELECTRA shine when layers remain frozen, showcasing their exceptional contextual comprehension. Furthermore, the study employs word clouds to visually encapsulate the essence of each Big Five personality trait, unraveling intricate relationships between specific words and these traits. The findings underscore the intricate interplay among model architecture, training methodologies, and layer freezing, offering valuable insights into strategies that yield optimal performance in predicting personality traits. In an age dominated by digital communication, this research contributes significantly to our understanding and prediction of human personalities. ©2024 IEEE.
Knowledge-Based Systems (09507051)301
Self-supervised learning aims to create semantic-enriched representation from unannotated data. A prevalent strategy in this field involves training a unified representation space that is invariant to various transformation combinations. However, creating a single invariant representation to multiple transformations poses several challenges. The efficacy of such a representation space depends on factors such as the intensity, sequence, and various combination scenarios of transformations. As a result, features generated in single representation space may exhibit limited adaptability for subsequent tasks. In contrast to the conventional SSL training approach, we introduce a novel method that involves constructing multiple atomic transformation-invariant representation subspaces. Each subspace in the proposed method is invariant to a specific atomic transformation from a predefined reference set. Our method offers increased flexibility by enabling the downstream task to weigh every atomic transformation-invariant subspace based on the desired feature space. A series of experiments were conducted to compare our approach to traditional self-supervised learning methods in order to assess its effectiveness. This evaluation encompassed diverse data regimes, datasets, evaluation protocols, and perspectives on source-destination data distribution. Our results highlight the superiority of our method compared to training strategies based on single transformation-invariant representation spaces. Additionally, our proposed method demonstrated superior performance in reducing false positives in the context of pulmonary nodule detection when compared to several recent supervised and self-supervised approaches. © 2024 Elsevier B.V.
Expert Systems with Applications (09574174)222
As an attractive research in biometric authentication, Text Independent Speaker Verification (TI-SV) problem aims to specify whether two given unconstrained utterances come from the same speaker or not. As state-of-the-art solutions, end-to-end approaches using deep neural networks seek to learn a highly discriminative speaker embedding space. In this paper, we propose a novel end-to-end approach for speaker embedding learning by focusing on two crucial factors: speaker embedder architecture and objective function. The proposed module in the speaker embedder is composed of an Efficient Multi-resolution feature Representation (EMR) block followed by a Multi-scale Channel Attention Fusion (MCAF) block. The EMR effectively addresses the issue of fixed resolution convolutional kernels which commonly used in most embedder architectures. Moreover, the MCAF significantly improves the simple summation-based feature fusion used in residual embedder networks. Regarding the objective function, we conduct the speaker embedding space towards learning the embedding-to-embedding relations, in addition to only embedding-to-training class relations employed by most previous methods. So, we propose to employ a dynamic graph attention network, on top of the proposed embedder to learn all informative relations between embeddings, and then learn both embedder and graph-based networks in an end-to-end manner. We conduct various experiments on a large-scale benchmark dataset called VoxCeleb1&2. The effectiveness of all proposed components is verified through an ablation study. We show the superior or competitive performances of the proposed approach compared to seven well-known embedding architectures and 32 SV systems, regarding two evaluation metrics, EER and minDCF, as well as the number of embedder parameters. © 2023 Elsevier Ltd
Lately, deep learning has become increasingly popular in resolving issues across multiple domains, including medical image analysis. This research introduces a process based on deep convolutional neural networks to diagnose Alzheimer's disease and its various stages by utilizing magnetic resonance imaging (MRI) scans. Identifying Alzheimer's disease (AD) in elderly individuals can be quite difficult. This is because it harms the brain cells related to memory and thinking abilities, and it's hard to tell apart from normal brain patterns in scans. Detecting it needs a special way to represent features for sorting it out. Deep learning methods can acquire such representations from the MRI data. In this paper, five different transfer learning models are trained in 15 binary classifiers, each of them can classify two of Alzheimer's disease, Mild Cognitive Impairment (MCI), and Cognitively Normal (CN) classes. This method finds the best transfer learning model for classifying each binary comparison. The proposed technique results in best accuracy of 92% for the AD vs. CN classifier, 94% for the AD vs. MCI classifier, and 72% for the MCI vs. CN classifier, which shows the effectiveness of transfer learning in distinguishing the AD vs. CN and the AD vs. MCI cases. © 2023 IEEE.
This paper presents the system developed by the Sartipi-Sedighin team for SemEval 2023 Task 2, which is a shared task focused on multilingual complex named entity recognition (NER), or MultiCoNER II. The goal of this task is to identify and classify complex named entities (NEs) in text across multiple languages. To tackle the MultiCoNER II task, we leveraged pre-trained language models (PLMs) fine-tuned for each language included in the dataset. In addition, we also applied a data augmentation technique to increase the amount of training data available to our models. Specifically, we searched for relevant NEs that already existed in the training data within Wikipedia, and we added new instances of these entities to our training corpus. Our team achieved an overall F1 score of 61.25% in the English track and 71.79% in the multilingual track across all 13 tracks of the shared task that we submitted to. © 2023 Association for Computational Linguistics.
World Wide Web (15731413)26(5)pp. 3027-3054
The standard Euclidean distance considers equal contributions for all features of each data sample pair when computing the similarity matrix, while different features of real-world datasets have different importance. This paper proposes a new clustering method based on reinforcement learning and soft feature selection with three innovative ideas. First, a novel distance metric based on the importance of features is introduced which additionally can disappear irrelevant features approximately. Second, a new soft weighting mechanism is defined based on this distance to determine the effect of the neighborhood probability in the similarity matrix. Since the training data consists of noisy and redundant features, a sparsity regularization term is applied to solve this problem and emphasizes feature selection. Third, after these dimensionality reduction steps, a new clustering method is developed according to reinforcement learning, which considers the obtained low-dimensional data points as the states of the learning agents. It also uses different actions until convergence to transfer the worst points with the most scattering from one cluster to another one, to produce coherent clusters as well as make a balance between them. The proposed method is able to present high within-cluster consistencies. The experimental results on several real-world datasets show good performance and efficiency of the proposed method. Statistical analysis, parameter sensitivity analysis, and time complexity analysis, all confirm the appropriateness of the results obtained. © 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
As a state-of-the-art solution for speaker verification problems, deep neural networks have been usefully employed for extracting speaker embeddings which represent speaker informative features. Objective functions, as the supervisors for the learning of discriminative embeddings, play a crucial role for this purpose. In this paper, motivated by the success of metric learning approaches, we investigate four newly proposed metrics in the literature, specifically for the speaker verification problem. For deeper comparisons, we consider these metrics from both main groups of metric-based objectives, i.e. instance-based and proxy-based ones. By considering embeddings as instances, the first group exploits the instance-to-instance relations, while the latter associates the instances to the proxies as representatives of training samples.Evaluations in terms of Equal Error Rate (EER) are conducted in two conventional manners: end-to-end and modular where cosine similarity and PLDA are applied to embeddings, respectively. Experimental results show that in the case of end-to-end, instances-based metrics outperform proxy-based ones, while interestingly the opposite behavior is gained for the modular case. Finally, the lowest EER is achieved by adopting one of the proxy-based metrics, namely SoftTriple, in the modular manner. It yields relative improvements up to 12% compared to the state-of-the art method, i.e. x-vector. © 2020 IEEE.
Quality and intelligibility of speech signals are degraded under additive background noise which is a critical problem for hearing aid and cochlear implant users. Motivated to address this problem, we propose a novel speech enhancement approach using a deep spectrum image translation network. To this end, we suggest a new architecture, called VGG19-UNet, where a deep fully convolutional network known as VGG19 is embedded at the encoder part of an image-to-image translation network, i.e. U-Net. Moreover, we propose a perceptuallymodified version of the spectrum image that is represented in Mel frequency and power-law non-linearity amplitude domains, representing good approximations of human auditory perception model. By conducting experiments on a real challenge in speech enhancement, i.e. unseen noise environments, we show that the proposed approach outperforms other enhancement methods in terms of both quality and intelligibility measures, represented by PESQ and ESTOI, respectively. © 2019 IEEE.
Computer Speech and Language (08852308)50pp. 105-125
From both perspectives of speech production and speech perception, vowels as syllable nuclei can be considered as the most significant speech events. Detection of vowel events from a speech signal is usually performed by a two-step procedure. First, a temporal objective contour (TOC), as a time-varying measure of vowel similarity, is generated from the speech signal. Second, vowel landmarks, as the places of vowel events, are extracted by locating prominent peaks of the TOC. In this paper, by employing some spectral models in a sequential manner, we propose a new framework that directly addresses three possible errors in the vowel detection problem, namely vowel deletion, consonant insertion, and vowel insertion. The proposed framework consists of three main steps as follows. At the first step, two solutions are proposed to essentially reduce the initial vowel deletion error. The first solution is to use the peaks detected by a conventional energy-based TOC, but without utilizing TOC smoothing and peak thresholding processes. The peaks detected by a spectral-based TOC generated on the basis of GMM models are also put forward as the second solution for achieving a smaller vowel deletion error. At the second step, a two-class support vector machine (SVM) classifier is adopted to identify the consonant peaks from the vowel ones. Removing the peaks classified as consonants reduces the consonant insertion error. Finally, a two-class SVM classifier is proposed to classify the consecutive peaks detected within the same vowel from the others. The merging of the peaks classified as “same vowel” considerably reduces the vowel insertion error. Experiments are separately conducted on three standard speech corpora, namely FARSDAT, TIMIT and TFARSDAT. The effectiveness of the techniques proposed to reduce three types of detection errors is verified. The criteria of total error (as the summation of three detection errors) and F-measure, respectively result in about 9.7% and 95.1% for FARSDAT, 17.5% and 91.3% for TIMIT, and 19.6% and 90.2% for the TFARSDAT corpus. The evaluation results show that the proposed framework outperforms the existing well-known methods in terms of both total error and F-measure on both read and spontaneous speech corpora. © 2018 Elsevier Ltd
Speech Communication (01676393)91pp. 28-48
Vowel detection methods usually adopt a two-stage procedure for detecting vowel landmarks. First, a temporal objective contour (TOC), as a time-varying measure of vowel-likeness, is generated from the speech signal. Then, vowel landmarks are extracted by determining outstanding peaks of the TOC. By focusing on the TOC generation stage, this paper presents a new model based on some proposed components called matched filters (MFs). Extraction of the MFs and design of the MF-based model constitute our two main contributions. Motivated by the human auditory system, the MFs are extracted by applying a series of perceptually-based processing operations to the speech spectra of the voiced frames. Accordingly, any factor leading to the variation of the speech spectra will change the extracted MFs, too. So, it is necessary to condition the filters to the factors affecting their characteristics. Based on this fact, the proposed MF-based model is designed as the following two steps. First, an acoustic space representing two effective factors, namely phonetic context and speaker identity, is modeled. Then, vowel and consonant MFs are conditioned to this context-speaker acoustic space. Indeed, instead of using a fixed filter bank for the entire speech signal (as a popular TOC generation technique), the proposed TOC is generated by adopting a pair of vowel and consonant MFs for each voiced speech frame. Experiments are separately conducted on two standard continuous speech corpora, a Persian corpus (FARSDAT) and an English one (TIMIT). Given various experiments, it is found that all characteristics employed in the proposed model decrease the total error measure with different degrees. Using the proposed method, the total error values of 14.2% and 18.9% are obtained in clean conditions for FARSDAT and TIMIT, respectively. Moreover, the effectiveness of the proposed algorithm is verified in additive noise conditions with different signal-to-noise ratios. According to the evaluation results, the proposed method shows a desirable performance in terms of the total error in comparison with the existing well-known methods on both corpora and both clean and noisy conditions. © 2017 Elsevier B.V.
International Journal of Pattern Recognition and Artificial Intelligence (02180014)25(1)pp. 1-35
One problem in background estimation is the inherent change in the background such as waving tree branches, water surfaces, camera shakes, and the existence of moving objects in every image. In this paper, a new method for background estimation is proposed based on function approximation in kernel domain. For this purpose, Weighted Kernel-based Learning Algorithm (WKLA) is designed. WKLA includes a weighted type of kernel least mean square algorithm with ability to function approximation in the presence of noise. So, the proposed background estimation method includes two stages: firstly, a novel algorithm for outlier detection namely Fuzzy Outlier Detector (FOD) is applied. Then obtained results are fed to the WKLA. The proposed approach can handle scenes containing moving backgrounds, gradual illumination changes, camera vibrations, and non-empty backgrounds. The qualitative results and quantitative evaluations on various indoor and outdoor sequences relative to existing approaches show the high accuracy and effectiveness of the proposed method in background estimation and foreground detection. © 2011 World Scientific Publishing Company.