filter by:
Articles
2025 29th International Computer Conference, Computer Society of Iran, CSICC 2025pp. 146-150
This paper compares two prevalent architectures in systolic arrays: weight stationary and output stationary methods. Systolic arrays utilize interconnected processing elements (PEs) to perform parallel processing, making them suitable for applications in digital signal processing, image processing, and machine learning. We focus on their implementation of 2D matrix multiplication, a fundamental operation in neural networks. Simulations were conducted using Verilog HDL within the Xilinx Vivado Design Suite 2019, employing a 3x1 input matrix and a 3x3 weight matrix. Results confirmed the functionality of both architectures, with output matrices matching expected results. Weight stationary designs minimized data movement, while output stationary designs enhanced throughput through effective input data reuse. Furthermore, this research demonstrates that the critical path remains constant despite increases in the number of processing units, providing valuable insights for future architectural designs. With a critical path delay of approximately 8.8 ns, corresponding to a maximum frequency of about 113 MHz, the study highlights that the critical path remains stable when scaling the number of PEs. Overall, this research validates the effectiveness of both architectures in high-performance matrix operations, offering valuable insights for future systolic array designs. © 2024 IEEE.
Journal of Supercomputing (15730484)79(16)pp. 18910-18946
Convolution widely has been used as the main part of the improvement in digital image processing applications. In convolutional computations, a large number of memory accesses and a huge amount of computations challenge its performance. Many of the related proposed convolvers are based on exact computations. Although exact convolvers keep the accuracy of the convolution operation at the top level, sometimes by missing a negligible amount of accuracy, the performance can be improved. Approximate computing is a new technique for solving computation overhead problems. In this paper, approximate 2D convolvers are presented which minimize the memory access rate and computations by a special factor of multiply-and-accumulate (MAC) terms. On the other hand, to preserve the flexibility for supporting different required accuracy, the proposed approximate convolvers are combined with the exact designs with real-time pre-processing stages by exploiting innovative methods which manage the hardware overhead. In comparison with conventional convolvers, the proposed designs improve the number of active resources which causes a significant reduction in power consumption. For 3 × 3 kernel size, the evaluation results on the Xilinx Virtex-7 (XC7V2000t) FPGA device show 34% and 20% power optimization of the proposed approximate and combined convolvers, respectively, in comparison with exact convolver (EC). Also, this improvement grows by increasing the kernel size. Finally, a comparison based on RMSE and PSNR for different sample images and filters reveals that the error rate and image quality reduction are acceptable for many real-time image processing applications. © 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
Journal of Supercomputing (15730484)78(2)pp. 2597-2615
Two-dimensional convolution plays a fundamental role in different image processing applications. Image convolving with different kernel sizes enriches the overall performance of image processing applications. In this regard, it is necessary to design of reconfigurable convolver with respect to desired kernel sizes list. In this paper, a novel approach is presented for implementation of an area-efficient reconfigurable convolver with appropriate throughput and convolution computational time for an arbitrary kernel size list. This approach is based on the adjustment of logical blocks arrangement in the conventional convolvers. The feasibility and benefits of the proposed approach are demonstrated through a case study of the design implementation on an FPGA platform using the XILINX ISE software. Compared to the well-known reconfigurable convolvers, the proposed design significantly reduces convolution computational time and improves throughput with a reasonable number of hardware resources. For instance, the proposed reconfigurable convolver only requires 0.38 ms to perform a 3 × 3 convolution on a 268 × 460 image with 8-bit pixels and only occupies 455 slices resource of Xilinx Virtex-4 (XC4VLX25) FPGA, in which the throughput of 324 million outputs per second (MOPS) is provided with 81 MHz clock frequency for kernel size of 3 × 3. On average, the MPOS of the proposed approach is approximately improved by 43.13% in relation to the other considered alternatives. Experimental results confirm that the proposed reconfigurable convolver is a very competitive design among the alternative reconfigurable convolvers. © 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
Iranian Conference on Machine Vision and Image Processing, MVIP (21666776)2020
Two-dimensional (2-D) convolution is a common operation in a wide range of signal and image processing applications such as edge detection, sharpening, and blurring. In the hardware implementation of these applications, 2d convolution is one of the most challenging parts because it is a compute-intensive and memory-intensive operation. To address these challenges, several design techniques such as pipelining, constant multiplication, and time-sharing have been applied in the literature which leads to convolvers with different implementation features. In this paper, based on design techniques, we classify these convolvers into four classes named Non-Pipelined Convolver, Reduced-Bandwidth Pipelined Convolver, Multiplier-Less Pipelined Convolver, and Time-Shared Convolver. Then, implementation features of these classes, such as critical path delay, memory bandwidth, and resource utilization, are analyticcally discussed for different convolution kernel sizes. Finally, an instance of each class is captured in Verilog and their features are evaluated by implementing them on a Virtex-7 FPGA and reported confirming the analytical discussions. © 2020 IEEE.
Within-die process variations in chip-multiprocessors cause wide variation in the distribution of frequency and power consumption between similar cores. In such condition, considering variations in task allocation and power management algorithms is crucial. Among various methods proposed to improve performance and reduce power consumption in variation-influenced chip-multiprocessors, two of them are more popular: variation-aware application scheduling and per-core dynamic voltage and frequency scaling (DVFS). However, none of the proposed methods guarantees load balancing in chip-multiprocessors and as a result, appropriate distribution of power in task allocation cannot be achieved by these methods. In this paper we propose an algorithm which intelligently maps (and remaps) applications into cores considering their variations. Moreover, proposed power management algorithm maximizes the overall performance at a given power budget. For power management, low resolution DVFS is used to decrease algorithm overhead. Additionally, the algorithm is 'balanced' to avoid hotspots on the chip. The proposed method is tested on a special-purpose simulator based on E3S data-set Experimental results show that the algorithm achieves significant improvement in terms of power distribution and fair application scheduling compared to the other methods. © 2020 IEEE.
IEEE Access (21693536)8pp. 39934-39945
Edge detection is a common operation in image/video processing applications. Canny edge detection, which performs well in different conditions, is one of the most popular and widely used of these algorithms. Canny's superior performance is due mainly to its provision of the ability to adjust the output quality by manipulating the edge detection parameters, Sigma and Threshold. Calculating values for these two parameters on-the-fly and based on the application's circumstances requires additional preprocessing, which increases the algorithm's computational complexity. To reduce the complexity, several proposed methods simply employ precalculated, fixed values for the Canny parameters (based on either the worst or typical conditions), which sacrifices the edge detection's performance in favor of the computational complexity. In this paper, an adaptive parameter selection method is proposed that selects values for the Canny parameters from a configuration table (rather than calculating in run-time), based on the estimated noise intensity of the input image and the minimum output performance that can satisfy the application requirements. This adaptive implementation of the Canny algorithm ensures that, while the edge detection performance (noise robustness) is higher than state-of-the-art counterparts in different circumstances, the execution time of the proposed Canny remains lower than those of recent cutting-edge Canny realizations. © 2013 IEEE.
IEEE Transactions on Circuits and Systems II: Express Briefs (15583791)66(1)pp. 146-150
2-D convolution has been used as a subsystem for filtering and enhancement in a wide range of signal and image processing applications. It is not only a memory intensive operation but also a compute intensive process. The previous efforts to improve the performance of 2-D convolution have primarily focused on the memory access challenges. However, the convolver's performance is highly dependent on the efficient design of the computation units, which can be enhanced significantly by employing some techniques such as pipelining. In this brief, a pipelined with low pixel access rate architecture for implementation of 2-D convolution is presented. Compared to the conventional convolvers, the proposed design involves a fixed (independent to the problem/kernel) size, and a significantly shorter critical path specifically for large kernel sizes where the proposed convolver works with 283-MHz clock frequency, on a Xilinx Virtex-7 (XC7V2000t) field-programmable gate array, for a {3\times 3} kernel. Additionally, the required pixel access rate of the new scheme is less than that of the state-of-The-Art methods, which is only 849 Mb/s for a {3\times 3} kernel and 8-bit pixels. The improvements in the critical path delay and the required pixel access rate are obtained without significant increase in the resource utilization. © 2004-2012 IEEE.
Karimiafshar, A.,
Montazeri, M.A.,
Kalbasi, M.,
Fanian, A. pp. 394-399
High-performance heterogeneous processors are being implemented in mobile embedded real-time systems because of the increasing computational requirements. A heterogeneous processor consists of cores that are asymmetric in performance and functionality. Such a design provides a cost-effective solution for processor manufacturers to continuously improve both single-thread performance and multi-thread throughput. These complex processors have a major drawback when they are used for real-time purposes. Their complexity difficults the calculation of the worst case execution time (WCET). This design, however, faces significant challenges in the operating system (OS), which traditionally assumes only homogeneous hardware. The OS scheduler needs to be heterogeneity-aware, so it can match jobs to cores according to the characteristics of both. In this paper, we make a case that a scheduler for heterogeneous multicore (HMC) systems should target three objectives: optimal performance, minimum load and maximum satisfied deadline. We deal with this issue via optimal task-to-core assignment. The proposed scheduler enables performance improvements, reduction in load and satisfied deadline increase for range of applications. Different scheduling alternatives have been evaluated and experimental results show that the proposed algorithm provides, on average, improvement in our three objectives ranging from 5.34% to 8.75%. © 2013 IEEE.
Vadiati m., ,
Ashouri m., ,
Hajizadeh e., E.,
Khoshnasib k., ,
Kalbasi, M.,
Hashemi m., 2011(580 CP)
Construction of sub-transmission substations in congested urban areas with dense population distribution has always been a challenging problem, thus reduction of substation footprint is a demanding need that has been considered in many countries. For responding this need, so far various techniques such as Hybrid solutions, Gas Insulated Switchgear and innovative Air Insulated Switchgear such as disconnecting circuit breaker (DCB) have been used. Although, these new technologies are responsible for this issue, but in many countries using conventional substations based on Air Insulated Switchgear have been preferred yet. In this paper a new solution to implement compact conventional AIS substations (CCAIS) is proposed. This solution is based on optimizing clearances within standard range, changing of equipment arrangement, indoor and two floor design and using a common foundation for several equipments. The CCAIS solution is proposed to Tehran Regional Electric Company (TREC) in Iran. Then we developed an algorithm based on multi-attribute decision making in MATLAB for comparing the proposed solution with other technologies to select optimal solution. The results which consist of ranking of alternatives and also sensitivity analysis on land cost parameter are illustrated in this paper. © 2012 The Institution of Engineering and Technology.