filter by:
Articles
Computer Networks (13891286)196
In the present article, we propose a virtual machine placement (VMP) algorithm for reducing power consumption in heterogeneous cloud data centers. We propose a novel model for the estimation of power consumption of datacenter's network. The proposed model is employed to estimate power consumption of a Fat-Tree network. It calculates the traffic of each network layer and uses the results to estimate the average power consumption of each switch in the network, which is used for network power calculation. Further, we employ the chemical reaction optimization (CRO) algorithm as a meta-heuristic algorithm to obtain a power-efficient mapping of virtual machines (VMs) to physical machines (PMs). Moreover, two kinds of solution encoding schemes, namely permutation-based and grouping-based encoding schemes, were utilized for representing individuals in CRO. For each encoding scheme, we designed proper operators required by the CRO for manipulating the molecules in search of more optimal solution candidates. Additionally, we modeled VMs with east–west and north–south communications, and PMs with constrained CPU, memory, and bandwidth capacity. Our network power model is integrated into the CRO algorithms to enable the estimation of both PMs and network power consumption. We compared our proposed methods with a number of similar methods. The evaluation results indicate that the proposed methods perform well and the CRO algorithm with the grouping-based encoding outperforms the rest of the methods in terms of power consumption. The evaluation results also show the significance of network power consumption. © 2021 Elsevier B.V.
Journal of Supercomputing (15730484)77(5)pp. 5120-5147
Graphics processing units (GPUs) are powerful in performing data-parallel applications. Such applications most often rely on the GPU’s memory hierarchy to deliver high performance. Designing efficient memory hierarchy for GPUs is a challenging task because of its wide architectural space. To moderate this challenge, this paper proposes a framework, called stack distance-analytic modeling (SDAM), to estimate memory performance of the GPU in terms of memory cycle counts. Providing the input data to the model is crucial in terms of the accuracy of the input data, and the time spent to obtain them. SDAM employs the stack distance analysis method and analytical modeling to obtain the required input accurately and swiftly. Further, it employs a detailed analytical model to estimate memory cycles. SDAM is validated against real GPU executions. Further, it is compared with a cycle accurate simulator. The experimental evaluations, performed on a set of memory-intensive benchmarks, prove that SDAM is faster and more accurate than cycle-accurate simulation, thus it can facilitate the GPU cache design-space exploration. For a selection of data-intensive benchmarks, SDAM showed a 32% average error in estimating memory data transfer cycles in a modern GPU, which outperforms cycle-accurate simulation, while it is an order of magnitude faster than the cycle-accurate simulation. Finally, the applicability of SDAM in exploring cache design-space in GPUs is demonstrated through experimenting with various cache designs. © 2020, Springer Science+Business Media, LLC, part of Springer Nature.
Simulation Modelling Practice and Theory (1569190X)91pp. 102-122
Memory footprint is a metric for quantifying data reuse in memory trace. It can also be used to approximate cache performance, especially in shared cache systems. Memory footprint is acquired through memory footprint analysis (FPA). However, its main limitation is that, for a memory trace of n accesses, the all-window FPA algorithm requires O(n3) time. Therefore, in this paper, we propose an analytical algorithm for FPA, whereby the average footprints are calculated in O(n2). The proposed algorithm can also be employed for window distribution analysis. Moreover, we propose a framework to enable the application of FPA to GPU kernels and model the performance of L1 cache memories. The results of experimental evaluations indicate that our proposed framework functions 1.55X slower than the Xiang's formula, as a fast average FPA method, while it can also be utilized for window distribution analysis. In the context of FPA-based cache performance estimation, the experimental results indicate a fair correlation between the estimated L1 miss rates and those of the native GPU executions. On average, the proposed framework has 23.8% error in the estimation of L1 cache miss rates. Further, our algorithm runs 125X slower than the reuse distance analysis (RDA) when analyzing a single kernel. However, the proposed method outperforms RDA in modeling shared caches and multiple kernel executions in GPUs. © 2018 Elsevier B.V.
ACM Transactions on Architecture and Code Optimization (15443973)15(4)
Reuse distance analysis (RDA) is a popular method for calculating locality profiles and modeling cache performance. The present article proposes a framework to apply the RDA algorithm to obtain reuse distance profiles in graphics processing unit (GPU) kernels. To study the implications of hardware-related parameters in RDA, two RDA algorithms were employed, including a high-level cache-independent RDA algorithm, called HLRDA, and a detailed RDA algorithm, called DRDA. DRDA models the effects of reservation fails in cache blocks and miss status holding registers to provide accurate cache-related performance metrics. In this case, the reuse profiles are cache-specific. In a selection of GPU kernels, DRDA obtained the L1 miss-rate breakdowns with an average error of 3.86% and outperformed the state-of-the-art RDA in terms of accuracy. In terms of performance, DRDA is 246,000×slower than the real GPU executions and 11×faster than GPGPUSim. HLRDA ignores the cache-related parameters and its obtained reuse profiles are general, which can be used to calculate miss rates in all cache sizes. Moreover, the average error incurred by HLRDA was 16.9%. © 2018 Association for Computing Machinery.
Computing and Informatics (25858807)38(2)pp. 421-453
In the present paper, we propose RDGC, a reuse distance-based performance analysis approach for GPU cache hierarchy. RDGC models the thread-level parallelism in GPUs to generate appropriate cache reference sequence. Further, reuse distance analysis is extended to model the multi-partition/multi-port parallel caches and employed by RDGC to analyze GPU cache memories. RDGC can be utilized for architectural space exploration and parallel application development through providing hit ratios and transaction counts. The results of the present study demonstrate that the proposed model has an average error of 3.72 % and 4.5 % (for L1 and L2 hit ratios, respectively). The results also indicate that the slowdown of RDGC is equal to 47 000 times compared to hardware execution, while it is 59 times faster than GPGPU-Sim simulator. © 2019 Slovak Academy of Sciences. All rights reserved.
Journal of Circuits, Systems and Computers (17936454)28(14)
Modern GPUs can execute multiple kernels concurrently to keep the hardware resources busy and to boost the overall performance. This approach is called simultaneous multiple kernel execution (MKE). MKE is a promising approach for improving GPU hardware utilization. Although modern GPUs allow MKE, the effects of different MKE scenarios have not adequately studied by the researchers. Since cache memories have significant effects on the overall GPU performance, the effects of MKE on cache performance should be investigated properly. The present study proposes a framework, called RDMKE (short for Reuse Distance-based profiling in MKEs), to provide a method for analyzing GPU cache memory performance in MKE scenarios. The raw memory access information of a kernel is first extracted and then RDMKE enforces a proper ordering to the memory accesses so that it represents a given MKE scenario. Afterward, RDMKE employs reuse distance analysis (RDA) to generate cache-related performance metrics, including hit ratios, transaction counts, cache sets and Miss Status Holding Register reservation fails. In addition, RDMKE provides the user with the RD profiles as a useful locality metric. The simulation results of single kernel executions show a fair correlation between the generated results by RDMKE and GPU performance counters. Further, the simulation results of 28 two-kernel executions indicate that RDMKE can properly capture the nonlinear cache behaviors in MKE scenarios. © 2019 World Scientific Publishing Company.
Modern GPUs employ simultaneous kernel executions (SKE), an equivalent to multitasking in CPUs, to maximize the hardware utilization and enhance the resulted performance. SKE paradigm is not yet fully explored by the research community. In this study, a reuse-distance (RD) based analysis approach, called SKERD, is proposed to analyze the effect of SKE scenarios on the kernel data reuse and GPU cache memories performance. Only two simultaneous kernels were considered in this work. Moreover, Three types of coarse-grained SM (streaming multiprocessor) partitioning schemes were investigated including an even SM to kernel partitioning and two SM partitioning schemes that assign the SMs to the kernels based on the kernel workloads. The simulation results show that none of the mentioned partitioning schemes always functions better than the others. Further, for some memory intensive kernels, SKE resulted in cache contentions and hit ratio degradation. Consequently, the effects of SKE on cache memories should be carefully considered. © 2017 IEEE.
Performance modeling plays an important role for optimal hardware design and optimized application implementation. This paper presents a very low overhead performance model, called VLAG, to approximate the data localities exploited by GPU kernels. VLAG receives source code-level information to estimate per memory-access instruction, per data array, and per kernel localities within GPU kernels. VLAG is only applicable to kernels with regular memory access patterns. VLAG was experimentally evaluated using an NVIDiA Maxwell GPU. For two different Matrix Multiplication kernels, the average errors of 7.68% and 6.29%, was resulted, respectively. The slowdown of VLAG for MM was measured 1.4X which, comparing with other approaches such as trace-driven simulation, is negligible. © 2017 IEEE.