filter by:
Articles
Peer-to-Peer Networking and Applications (19366450)18(3)
Trustworthy manufacturer selection is essential for ensuring production quality control. Existing methodologies for selecting manufacturers often rely on participant sentiments or opinions, operating within a non-transparent framework. This paper proposes a comprehensive approach to trustworthy manufacturer selection by integrating a peer-to-peer system with a machine learning model. Specifically, it leverages blockchain technology to create a decentralized infrastructure that enhances transparency in the selection process. Furthermore, an intelligent model is introduced, incorporating a manufacturer evaluation module and the Bidirectional Encoder Representations from Transformers (BERT) language model to classify participant sentiments and opinions. To improve the efficiency of sentiment classification, under-sampling techniques are used to balance the datasets. The proposed method is applied within the pharmaceutical industry, where the increasing number of drug manufacturers has heightened the need for reliable manufacturer selection processes. The approach achieved a sentiment classification accuracy of 91% and an F1-score macro-average of 0.89. Comprehensive evaluations confirm the effectiveness of the proposed integrated model, offering a robust solution for trustworthy manufacturer selection. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.
Soft Computing (14327643)29(2)pp. 1243-1258
Demand forecasting has emerged as a crucial element in supply chain management. It is essential to identify anomalous data and continuously improve the forecasting model with new data. However, existing literature fails to comprehensively cover both aspects of anomaly detection and continuous improvement in demand forecasting. This study proposes an enhanced model to improve accuracy in the demand forecasting. The proposed model introduces a novel data handling method that incorporates an anomaly detection autoencoder, improved with anomaly correction mechanisms. The data handling approach simultaneously detects data anomalies, distinguishes between expected and unexpected anomalies, and corrects anomalous data, ensuring cleaner input for demand forecasting. Then, the proposed model employs a long short-term memory architecture for demand forecasting, enhanced with a continuous improvement method. Thus, the model not only forecasts demand but also retrains the model when the anomaly data surpasses the predetermined threshold, thereby improving the accuracy of forecasting. The results show that the proposed model outperforms other models in detecting data anomalies, achieving an average precision-recall of 0.922, a receiver operating characteristic value of 0.739, and a significance level of less than 0.05. Finally, the model exhibits superior performance in demand forecasting, with average mean squared error, root mean squared error, and mean absolute error values of 33.167, 4.347, and 1.509, respectively, all with a significance level of less than 0.05. © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2025.
Jahani, M.,
Zojaji, Z.,
Montazerolghaem, A.,
Palhang, M.,
Ramezani, R.,
Golkarnoor, A.,
Safaei, A.A.,
Bahak, H.,
Saboori, P.,
Halaj, B.S. Journal Of Medical Signals And Sensors (22287477)15(1)
Background: The pharmaceutical industry has seen increased drug production by different manufacturers. Failure to recognize future needs has caused improper production and distribution of drugs throughout the supply chain of this industry. Forecasting demand is one of the basic requirements to overcome these challenges. Forecasting the demand helps the drug to be well estimated and produced at a certain time. Methods: Artificial intelligence (AI) technologies are suitable methods for forecasting demand. The more accurate this forecast is the better it will be to decide on the management of drug production and distribution. Isfahan AI competitions-2023 have organized a challenge to provide models for accurately predicting drug demand. In this article, we introduce this challenge and describe the proposed approaches that led to the most successful results. Results: A dataset of drug sales was collected in 12 pharmacies of Hamadan University of Medical Sciences. This dataset contains 8 features, including sales amount and date of purchase. Competitors compete based on this dataset to accurately forecast the volume of demand. The purpose of this challenge is to provide a model with a minimum error rate while addressing some qualitative scientific metrics. Conclusions: In this competition, methods based on AI were investigated. The results showed that machine learning methods are particularly useful in drug demand forecasting. Furthermore, changing the dimensions of the data features by adding the geographic features helps increase the accuracy of models. © 2025 Journal of Medical Signals & Sensors.
Davanian, F.,
Adibi, I.,
Tajmirriahi, M.,
Monemian, M.,
Zojaji, Z.,
Montazerolghaem, A.,
Asadinia, M.A.,
Mirghaderi, S.M.,
Esfahani, S.A.N.,
Kazemi, M. Journal Of Medical Signals And Sensors (22287477)15(2)
Background: Multiple sclerosis (MS) is one of the most common reasons of neurological disabilities in young adults. The disease occurs when the immune system attacks the central nervous system and destroys the myelin of nervous cells. This results in appearing several lesions in the magnetic resonance (MR) images of patients. Accurate determination of the amount and the place of lesions can help physicians to determine the severity and progress of the disease. Method: Due to the importance of this issue, this challenge has been dedicated to the segmentation and localization of lesions in MR images of patients with MS. The goal was to segment and localize the lesions in the flair MR images of patients as close as possible to the ground truth masks. Results: Several teams sent us their results for the segmentation and localization of lesions in MR images. Most of the teams preferred to use deep learning methods. The methods varied from a simple U-net structure to more complicated networks. Conclusion: The results show that deep learning methods can be useful for segmentation and localization of lesions in MR images. In this study, we briefly described the dataset and the methods of teams attending the competition. © 2025 Journal of Medical Signals & Sensors.
Sedighin, F.,
Monemian, M.,
Zojaji, Z.,
Montazerolghaem, A.,
Asadinia, M.A.,
Mirghaderi, S.M.,
Esfahani, S.A.N.,
Kazemi, M.,
Mokhtari, R.,
Mohammadi, M. Journal Of Medical Signals And Sensors (22287477)15(1)
Background: Computer-aided diagnosis (CAD) methods have become of great interest for diagnosing macular diseases over the past few decades. Artificial intelligence (AI)-based CADs offer several benefits, including speed, objectivity, and thoroughness. They are utilized as an assistance system in various ways, such as highlighting relevant disease indicators to doctors, providing diagnosis suggestions, and presenting similar past cases for comparison. Methods: Much specifically, retinal AI-CADs have been developed to assist ophthalmologists in analyzing optical coherence tomography (OCT) images and making retinal diagnostics simpler and more accurate than before. Retinal AI-CAD technology could provide a new insight for the health care of humans who do not have access to a specialist doctor. AI-based classification methods are critical tools in developing improved retinal AI-CAD technology. The Isfahan AI-2023 challenge has organized a competition to provide objective formal evaluations of alternative tools in this area. In this study, we describe the challenge and those methods that had the most successful algorithms. Results: A dataset of OCT images, acquired from normal subjects, patients with diabetic macular edema, and patients with other macular disorders, was provided in a documented format. The dataset, including the labeled training set and unlabeled test set, was made accessible to the participants. The aim of this challenge was to maximize the performance measures for the test labels. Researchers tested their algorithms and competed for the best classification results. Conclusions: The competition is organized to evaluate the current AI-based classification methods in macular pathology detection. We received several submissions to our posted datasets that indicate the growing interest in AI-CAD technology. The results demonstrated that deep learning-based methods can learn essential features of pathologic images, but much care has to be taken in choosing and adapting appropriate models for imbalanced small datasets. © 2025 Journal of Medical Signals & Sensors.
Kenari, A.R.,
Montazerolghaem, A.,
Zojaji, Z.,
Ghatee, M.,
Yousefimehr, B.,
Rahmani, A.,
Kalani, M.,
Kiyanpour, F.,
Kiani-abari, M.,
Fakhar, M.Y. Journal Of Medical Signals And Sensors (22287477)15(2)
Background: Gastroesophageal reflux disease (GERD) is a prevalent digestive disorder that impacts millions of individuals globally. Multichannel intraluminal impedance-pH (MII-pH) monitoring represents a novel technique and currently stands as the gold standard for diagnosing GERD. Accurately characterizing reflux events from MII data are crucial for GERD diagnosis. Despite the initial introduction of clinical literature toward software advancements several years ago, the reliable extraction of reflux events from MII data continues to pose a significant challenge. Achieving success necessitates the seamless collaboration of two key components: a reflux definition criteria protocol established by gastrointestinal experts and a comprehensive analysis of MII data for reflux detection. Method: In an endeavor to address this challenge, our team assembled a dataset comprising 201 MII episodes. We meticulously crafted precise reflux episode definition criteria, establishing the gold standard and labels for MII data. Result: A variety of signal-analyzing methods should be explored. The first Isfahan Artificial Intelligence Competition in 2023 featured formal assessments of alternative methodologies across six distinct domains, including MII data evaluations. Discussion: This article outlines the datasets provided to participants and offers an overview of the competition results. © 2025 Journal of Medical Signals & Sensors.
Cluster Computing (13867857)28(1)
Poor-quality products and counterfeit tags within the supply chains pose major challenges. Although several approaches have been suggested, they do not address all the challenges. This paper proposes a blockchain-integrated supply chain to tackle the poor-quality products and counterfeiting tags issues. In this way, the requirements for secure supply chain system are identified, and several algorithms are proposed. The reliable manufacturer selection algorithm improves product quality using participant opinions. Smart traceability algorithm monitors the product path using sensors, preventing the distribution of poor-quality products in the supply chain. Additionally, two counterfeit tag detection algorithms employ a weighted graph to simulate the shortest paths for product distribution to identify cloning, application, and modification attacks on the products tags. Detailed security, scalability, performance, and comparative analyses for the drug supply chain shows the proposed system successfully improves product quality and accurately detects counterfeit tags without a significant drop in performance. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
Multimedia Tools and Applications (13807501)83(35)pp. 83275-83309
The atmosphere is one of the game elements that can significantly influence player's emotions. However, creating an immersive atmosphere that effectively influences player emotions poses several challenges, necessitating the utilization of various elements, such as audio-visual coordination and gameplay design. This paper introduces a general framework for procedurally generating dungeons with joyful and horror atmospheres in games, providing an abstract perspective to address these challenges. The proposed framework introduces a categorization system for game elements based on their role within the game. Leveraging this categorization, the Comprehensive Arrangement of Game Elements (CAGE) pattern is introduced, which facilitates the appropriate placement of elements within the dungeon environment. Subsequently, the General Framework for Generating Dungeons with Atmosphere (GFGDA) is employed to procedurally create the dungeon using the Feasible–Infeasible Two-Population (FI-2Pop) algorithm. To enhance gameplay experience, similar elements in the dungeon environment that impact gameplay are grouped and their coordination is evaluated by creating a graph based on the CAGE pattern. The transition and coordination of audio-visual elements along the path between these impactful elements are assessed in order to generate an immersive atmosphere within the dungeon. To ensure diversity, examining the variety of dungeons generated over 100 runs demonstrates that our method consistently produces distinct results in each iteration. Moreover, two comparative studies were conducted, one with 51 volunteers and another with 10 volunteers. In the first study, the Game Experience Questionnaire (GEQ) was utilized to assess the emotional impact of dungeons generated by our method. These were compared to dungeons created using a uniform random approach, alongside relevant research. The results suggest that our method significantly influences player emotions across the four components of the GEQ—sensory and imaginary immersion, flow, negative effects, and challenge—when compared to dungeons generated by the uniform random approach and another researched method. In another study, the emotional impact of two dungeons, one generated with joyful elements and the other with eerie elements, was evaluated using the GEQ. The findings indicate significant differences between the two components of the GEQ—tension and positive effects—when players interacted with the level containing joyful elements compared to the one with eerie elements. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
Hosseini, H.,
Zare, M.S.,
Mohammadi, A.H.,
Kazemi, A.,
Zojaji, Z.,
Nematbakhsh, M.A. pp. 272-278
Retrieval augmented generation (RAG) models, which integrate large-scale pre-trained generative models with external retrieval mechanisms, have shown significant success in various natural language processing (NLP) tasks. However, applying RAG models in Persian language as a low-resource language, poses distinct challenges. These challenges primarily involve the preprocessing, embedding, retrieval, prompt construction, language modeling, and response evaluation of the system. In this paper, we address the challenges towards implementing a real-world RAG system for Persian language called PersianRAG. We propose novel solutions to overcome these obstacles and evaluate our approach using several Persian benchmark datasets. Our experimental results demonstrate the capability of the PersianRAG framework to enhance question answering task in Persian. © 2024 IEEE.
The tourism industry has undergone a significant shift towards data-driven strategies in recent years. As a means of improving the quality of their service and performance, service providers are analyzing feedback from their customers to increase the number of tourists they attract. Negative feedback also provides valuable insights into the factors that detract from a location's appeal. Datasets that gather information on people's experiences and opinions of tourist destinations can be analyzed to extract valuable information. However, there are currently few existing datasets that specifically capture user reviews about historical and tourist attractions in Iran. To fill this gap, users have shared their travel experiences on various websites, and sentiment analysis can be employed to extract insights from this data. Effective sentiment analysis requires a suitable approach for data extraction, pre-processing, and storage. This study provides a framework for the user review dataset preparation, including data collection, ETL, data storage, and evaluation phases. A rich dataset containing user reviews about 178 Iran's historical and tourist attractions was prepared through the proposed framework in which automated crawlers were developed to collect data from Tripadvisor platforms. Data labelling was achieved using the DistilBERT-base-uncased language model for sentiment analysis and human evaluators for final annotations. A total of approximately 25 thousand samples were included in the dataset, and positive user comments outnumbered negative user comments by a wide margin. This high percentage of positive comments suggests that the locations were of a satisfactory standard, making it likely that users would return in the future. The findings of this study can help providers to improve the overall quality of their services by analyzing user reviews. The proposed framework and achieved dataset can also guide future efforts to leverage data for improved performance and customer satisfaction in the tourism industry by identifying areas that need improvement. © 2023 IEEE.
Kazemi, A.,
Zojaji, Z.,
Malverdi, M.,
Mozafari, J.,
Ebrahimi, F.,
Abadani, N.,
Varasteh, M.R.,
Nematbakhsh, M.A. Information Retrieval Journal (13864564)26(1)
Nowadays, a considerable volume of news articles is produced daily by news agencies worldwide. Since there is an extensive volume of news on the web, finding exact answers to the users’ questions is not a straightforward task. Developing Question Answering (QA) systems for the news articles can tackle this challenge. Due to the lack of studies on Persian QA systems and the importance and wild applications of QA systems in the news domain, this research aims to design and implement a QA system for the Persian news articles. This is the first attempt to develop a Persian QA system in the news domain to our best knowledge. We first create FarsQuAD: a Persian QA dataset for the news domain. We analyze the type and complexity of the users’ questions about the Persian news. The results show that What and Who questions have the most and Why and Which questions have the least occurrences in the Persian news domain. The results also indicate that the users usually raise complex questions about the Persian news. Then we develop FarsNewsQA: a QA system for answering questions about Persian news. We developed three models of the FarsNewsQA using BERT, ParsBERT, and ALBERT. The best version of the FarsNewsQA offers an F1 score of 75.61%, which is comparable with that of QA system on the English SQuAD dataset made by the Stanford university, and shows the new Bert-based technologies works well for Persian news QA systems. © 2023, The Author(s), under exclusive licence to Springer Nature B.V.
International Journal of Engineering, Transactions B: Applications (1728144X)36(2)pp. 335-347
Gender is an important aspect of a person's identity. In many applications, gender identification is useful for personalizing services and recommendations. On the other hand, many people today spend a lot of time on their mobile phones. Studies have shown that the way users interact with mobile phones is influenced by their gender. But the existing methods for identify the gender of mobile phone users are either not accurate enough or require sensors and specific user activities. In this paper, for the first time, the internet usage patterns are used to identify the gender of mobile phone users. To this end, the interaction data, and specially the internet usage patterns of a random sample of people are automatically recorded by an application installed on their mobile phones. Then, the gender identification is modeled using different machine learning classification methods. The evaluations showed that the internet features play an important role in recognizing the users gender. The linear support vector machine was the superior classifier with the accuracy of 85% and F-measure of 85%. © 2023 Materials and Energy Research Center. All rights reserved.
Social Network Analysis and Mining (18695450)12(1)
As online social networks are experiencing extreme popularity growth, determining the veracity of online statements denoted by rumors automatically as earliest as possible is essential to prevent the harmful effects of propagating misinformation. Early detection of rumors is facilitated by considering the wisdom of the crowd through analyzing different attitudes expressed toward a rumor (i.e., users’ stances). Stance detection is an imbalanced problem as the querying and denying stances against a given rumor are significantly less than supportive and commenting stances. However, the success of stance-based rumor detection significantly depends on the efficient detection of “query” and “deny” classes. The imbalance problem has led the previous stance classifier models to bias toward the majority classes and ignore the minority ones. Consequently, the stance and subsequently rumor classifiers have been faced with the problem of low performance. This paper proposes a novel adaptive cost-sensitive loss function for learning imbalanced stance data using deep neural networks, which improves the performance of stance classifiers in rare classes. The proposed loss function is a cost-sensitive form of cross-entropy loss. In contrast to most of the existing cost-sensitive deep neural network models, the utilized cost matrix is not manually set but adaptively tuned during the learning process. Hence, the contributions of the proposed method are both in the formulation of the loss function and the algorithm for calculating adaptive costs. The experimental results of applying the proposed algorithm to stance classification of real Twitter and Reddit data demonstrate its capability in detecting rare classes while improving the overall performance. The proposed method improves the mean F-score of rare classes by about 13% in RumorEval 2017 dataset and about 20% in RumorEval 2019 dataset. © 2022, The Author(s), under exclusive licence to Springer-Verlag GmbH Austria, part of Springer Nature.
Applied Soft Computing (15684946)122
Despite the empirical success of Genetic programming (GP) in various symbolic regression applications, GP is not still known as a reliable problem-solving technique in this domain. Non-locality of GP representation and operators causes ineffectiveness of its search procedure. This study employs semantic schema theory to control and guide the GP search and proposes a local GP called semantic schema-based genetic programming (SBGP). SBGP partitions the semantic search space into semantic schemas and biases the search to the significant schema of the population, which is gradually progressing towards the optimal solution. Several semantic local operators are proposed for performing a local search around the significant schema. In combination with schema evolution as a global search, the local in-schema search provides an efficient exploration–exploitation control mechanism in SBGP. For evaluating the proposed method, we use six benchmarks, including synthesized and real-world problems. The obtained errors are compared to the best semantic genetic programming algorithms, on the one hand, and data-driven layered learning approaches, on the other hand. Results demonstrate that SBGP outperforms all mentioned methods in four out of six benchmarks up to 87% in the first set and up to 76% in the second set of experiments in terms of generalization measured by root mean squared error. © 2022 Elsevier B.V.
Road Materials and Pavement Design (14680629)21(3)pp. 850-866
Fatigue cracking is the most important structural failure in flexible pavements. The results of a laboratory study evaluating the fatigue properties of mixtures containing precipitated calcium carbonate (PCC) using indirect tensile fatigue (ITF) test were investigated in this paper. The hot mix asphalt (HMA) samples were made with four PCC contents (0%, 5%, 10%, and 15%), and tested at three different testing temperatures (2°C, 10°C and 20°C) and stress levels (100, 300, and 500 kPa). Due to the complex behaviour of asphalt pavement materials under various loading conditions, pavement structure, and environmental conditions, accurately predicting the fatigue life of asphalt pavement is difficult. In this study, genetic programming (GP) is utilised to predict the fatigue life of HMA. Based on the results of the ITF test, PCC improved the fatigue behaviour of studied mixes at different temperatures. But, the considerable negative effect of the increase of the temperature on the fatigue life of HMA is evident. On the other hand, the results indicate The GP-based formulas are simple, straightforward, and particularly valuable for providing an analysis tool accessible to practicing engineers. © 2018, © 2018 Informa UK Limited, trading as Taylor & Francis Group.
Soft Computing (14327643)22(10)pp. 3237-3260
A considerable research effort has been performed recently to improve the power of genetic programming (GP) by accommodating semantic awareness. The semantics of a tree implies its behavior during the execution. A reliable theoretical modeling of GP should be aware of the behavior of individuals. Schema theory is a theoretical tool used to model the distribution of the population over a set of similar points in the search space, referred by schema. There are several major issues with relying on prior schema theories, which define schemata in syntactic level. Incorporating semantic awareness in schema theory has been scarcely studied in the literature. In this paper, we present an improved approach for developing the semantic schema in GP. The semantics of a tree is interpreted as the normalized mutual information between its output vector and the target. A new model of the semantic search space is introduced according to semantics definition, and the semantic building block space is presented as an intermediate space between semantic and genotype ones. An improved approach is provided for representing trees in building block space. The presented schema is characterized by Poisson distribution of trees in this space. The corresponding schema theory is developed for predicting the expected number of individuals belonging to proposed schema, in the next generation. The suggested schema theory provides new insight on the relation between syntactic and semantic spaces. It has been shown to be efficient in comparison with the existing semantic schema, in both generalization and diversity-preserving aspects. Experimental results also indicate that the proposed schema is much less computationally expensive than the similar work. © 2017, Springer-Verlag GmbH Germany.
Applied Intelligence (0924669X)48(6)pp. 1442-1460
Semantic schema theory is a theoretical model used to describe the behavior of evolutionary algorithms. It partitions the search space to schemata, defined in semantic level, and studies their distribution during the evolution. Semantic schema theory has definite advantages over popular syntactic schema theories, for which the reliability and usefulness are criticized. Integrating semantic awareness in genetic programming (GP) in recent years sheds new light also on schema theory investigations. This paper extends the recent work in semantic schema theory of GP by utilizing information based clustering. To this end, we first define the notion of semantics for a tree based on the mutual information between its output vector and the target and introduce semantic building blocks to facilitate the modeling of semantic schema. Then, we propose information based clustering to cluster the building blocks. Trees are then represented in terms of the active occurrence of building block clusters and schema instances are characterized by an instantiation function over this representation. Finally, the expected number of schema samples is predicted by the suggested theory. In order to evaluate the suggested schema, several experiments were conducted and the generalization, diversity preserving capability and efficiency of the schema were investigated. The results are encouraging and remarkably promising compared with the existing semantic schema. © 2017, Springer Science+Business Media, LLC.
Applied Intelligence (0924669X)45(4)pp. 1066-1088
Automated program repair is still a highly challenging problem mainly due to the reliance of the current techniques on test cases to validate candidate patches. This leads to the increasing unreliability of the final patches since test cases are partial specifications of the software. In the present paper, an automated program repair method is proposed by integrating genetic programming (GP) and model checking (MC). Due to its capabilities to verify the finite state systems, MC is employed as an appropriate criterion for evolving programs to calculate the fitness in GP. The application of MC for the fitness evaluation, which is novel in the context of program repair, addresses an important gap in the current heuristic approaches to the program repair. Being focused on fault detection based on the desired aspects, it enables the programmers to detect faults according to the definition of properties. Creating a general method, this characteristic can be effectively customized for different domains of application and the corresponding faults. Apart from various types of faults, the proposed method is capable of handling concurrency bugs which are not the case in many general repair methods. To evaluate the proposed method, it was implemented as a tool, named JBF, to repair Java programs. To meet the objectives of the study, some experiments were conducted in which certain programs with known bugs were automatically repaired by the JBF tool. The obtained results are encouraging and remarkably promising. © 2016, Springer Science+Business Media New York.
Soft Computing (14327643)20(5)pp. 2031-2045
Determining suitable mesh density for complicated finite element analysis, e.g., laser forming process, has always been the main concern of analytical engineers because of its high computation time and costs. Few works addressed the application of optimization methods for finite element analysis of linear path laser scan; however, no study has yet considered optimum finite element analysis of circular path laser forming. The main objective of this article is to develop a method for determining optimum mesh density to estimate the deflection caused by laser beam circular path scan considering analysis time and forming accuracy. Optimum ranges of mesh densities are investigated first and then a deflection estimating process based on adaptive-network-based fuzzy inference system has been introduced. The proposed model was finally optimized using genetic algorithm considering accuracy and time. The numerical analysis results were finally confirmed by the conducted experimental results. © 2015, Springer-Verlag Berlin Heidelberg.
Applied Intelligence (0924669X)44(1)pp. 67-87
Schema theory is the most well-known model of evolutionary algorithms. Imitating from genetic algorithms (GA), nearly all schemata defined for genetic programming (GP) refer to a set of points in the search space that share some syntactic characteristics. In GP, syntactically similar individuals do not necessarily have similar semantics. The instances of a syntactic schema do not behave similarly, hence the corresponding schema theory becomes unreliable. Therefore, these theories have been rarely used to improve the performance of GP. The main objective of this study is to propose a schema theory which could be a more realistic model for GP and could be potentially employed for improving GP in practice. To achieve this aim, the concept of semantic schema is introduced. This schema partitions the search space according to semantics of trees, regardless of their syntactic variety. We interpret the semantics of a tree in terms of the mutual information between its output and the target. The semantic schema is characterized by a set of semantic building blocks and their joint probability distribution. After introducing the semantic building blocks, an algorithm for finding them in a given population is presented. An extraction method that looks for the most significant schema of the population is provided. Moreover, an exact microscopic schema theorem is suggested that predicts the expected number of schema samples in the next generation. Experimental results demonstrate the capability of the proposed schema definition in representing the semantics of the schema instances. It is also revealed that the semantic schema theorem estimation is more realistic than previously defined schemata. © 2015, Springer Science+Business Media New York.