A Deep Learning model for Question Analysis in Low-resource Languages: A Dataset and Case Study for Persian
Abstract
Question-answering systems, characterized by their fundamental functions of question classification, information retrieval, and answer selection, demand refinement to enhance precision in retrieving exact answers. Question classification, a cornerstone task, anticipates the probable answer to a posed query. However, the performance of question classification algorithms is hampered, particularly in agglutinative languages with complex morphology like Persian, where linguistic resources are limited. In this study, we propose a novel multi-layer Long-short-term memory (LSTM) Attention Convolutional Neural Network (CNN) (LACNN) classifier, tailored to extract pertinent information from Persian language contexts. Notably, this model operates autonomously, obviating the need for prior knowledge or external features. Moreover, we introduce UIMQC, the first medical question dataset in Persian, derived from the English GARD dataset. The inquiries within UIMQC are inherently intricate, often pertaining to rare diseases necessitating specialized diagnosis. Our experimental findings demonstrate a notable enhancement over baseline methods, with a 9% performance increase on the UTQC dataset, and achieving 67.08% accuracy on the UIMQC dataset. Consequently, we advocate for the adoption of the LACNN model in various morphological analysis tasks across low-resource languages, as in Question Answering systems it improves the performance for retrieving accurate answers to the users’ queries. ©2024 IEEE.