Articles
2025 29th International Computer Conference, Computer Society of Iran, CSICC 2025pp. 146-150
This paper compares two prevalent architectures in systolic arrays: weight stationary and output stationary methods. Systolic arrays utilize interconnected processing elements (PEs) to perform parallel processing, making them suitable for applications in digital signal processing, image processing, and machine learning. We focus on their implementation of 2D matrix multiplication, a fundamental operation in neural networks. Simulations were conducted using Verilog HDL within the Xilinx Vivado Design Suite 2019, employing a 3x1 input matrix and a 3x3 weight matrix. Results confirmed the functionality of both architectures, with output matrices matching expected results. Weight stationary designs minimized data movement, while output stationary designs enhanced throughput through effective input data reuse. Furthermore, this research demonstrates that the critical path remains constant despite increases in the number of processing units, providing valuable insights for future architectural designs. With a critical path delay of approximately 8.8 ns, corresponding to a maximum frequency of about 113 MHz, the study highlights that the critical path remains stable when scaling the number of PEs. Overall, this research validates the effectiveness of both architectures in high-performance matrix operations, offering valuable insights for future systolic array designs. © 2024 IEEE.
Journal of Supercomputing (15730484)79(16)pp. 18910-18946
Convolution widely has been used as the main part of the improvement in digital image processing applications. In convolutional computations, a large number of memory accesses and a huge amount of computations challenge its performance. Many of the related proposed convolvers are based on exact computations. Although exact convolvers keep the accuracy of the convolution operation at the top level, sometimes by missing a negligible amount of accuracy, the performance can be improved. Approximate computing is a new technique for solving computation overhead problems. In this paper, approximate 2D convolvers are presented which minimize the memory access rate and computations by a special factor of multiply-and-accumulate (MAC) terms. On the other hand, to preserve the flexibility for supporting different required accuracy, the proposed approximate convolvers are combined with the exact designs with real-time pre-processing stages by exploiting innovative methods which manage the hardware overhead. In comparison with conventional convolvers, the proposed designs improve the number of active resources which causes a significant reduction in power consumption. For 3 × 3 kernel size, the evaluation results on the Xilinx Virtex-7 (XC7V2000t) FPGA device show 34% and 20% power optimization of the proposed approximate and combined convolvers, respectively, in comparison with exact convolver (EC). Also, this improvement grows by increasing the kernel size. Finally, a comparison based on RMSE and PSNR for different sample images and filters reveals that the error rate and image quality reduction are acceptable for many real-time image processing applications. © 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
Journal of Supercomputing (15730484)78(2)pp. 2597-2615
Two-dimensional convolution plays a fundamental role in different image processing applications. Image convolving with different kernel sizes enriches the overall performance of image processing applications. In this regard, it is necessary to design of reconfigurable convolver with respect to desired kernel sizes list. In this paper, a novel approach is presented for implementation of an area-efficient reconfigurable convolver with appropriate throughput and convolution computational time for an arbitrary kernel size list. This approach is based on the adjustment of logical blocks arrangement in the conventional convolvers. The feasibility and benefits of the proposed approach are demonstrated through a case study of the design implementation on an FPGA platform using the XILINX ISE software. Compared to the well-known reconfigurable convolvers, the proposed design significantly reduces convolution computational time and improves throughput with a reasonable number of hardware resources. For instance, the proposed reconfigurable convolver only requires 0.38 ms to perform a 3 × 3 convolution on a 268 × 460 image with 8-bit pixels and only occupies 455 slices resource of Xilinx Virtex-4 (XC4VLX25) FPGA, in which the throughput of 324 million outputs per second (MOPS) is provided with 81 MHz clock frequency for kernel size of 3 × 3. On average, the MPOS of the proposed approach is approximately improved by 43.13% in relation to the other considered alternatives. Experimental results confirm that the proposed reconfigurable convolver is a very competitive design among the alternative reconfigurable convolvers. © 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
Iranian Conference on Machine Vision and Image Processing, MVIP (21666776)2020
Two-dimensional (2-D) convolution is a common operation in a wide range of signal and image processing applications such as edge detection, sharpening, and blurring. In the hardware implementation of these applications, 2d convolution is one of the most challenging parts because it is a compute-intensive and memory-intensive operation. To address these challenges, several design techniques such as pipelining, constant multiplication, and time-sharing have been applied in the literature which leads to convolvers with different implementation features. In this paper, based on design techniques, we classify these convolvers into four classes named Non-Pipelined Convolver, Reduced-Bandwidth Pipelined Convolver, Multiplier-Less Pipelined Convolver, and Time-Shared Convolver. Then, implementation features of these classes, such as critical path delay, memory bandwidth, and resource utilization, are analyticcally discussed for different convolution kernel sizes. Finally, an instance of each class is captured in Verilog and their features are evaluated by implementing them on a Virtex-7 FPGA and reported confirming the analytical discussions. © 2020 IEEE.