Table Of Content

GPGPU Performance Estimation with Core and Memory Frequency Scaling Qiang Wang, Xiaowen Chu Department of Computer Science, Hong Kong Baptist University {qiangwang, chxw}@comp.hkbu.edu.hk Abstract—GraphicsProcessingUnits(GPUs)supportdynamic frequency [9]. In this paper we would like to address the voltage and frequency scaling (DVFS) in order to balance performance modeling problem. computational performance and energy consumption. However There are three main challenges of GPU performance pre- there still lacks simple and accurate performance estimation 7 diction under different core and memory frequencies. First, of a given GPU kernel under different frequency settings on 1 real hardware, which is important to decide best frequency comparedtotraditionalCPU,GPUshavemuchmorecomplex 0 configurationforenergysaving.Thispaperrevealsafine-grained memory hierarchy of which GPU vendors reveal few details. 2 model to estimate the execution time of GPU kernels with Second, GPUs have two independent frequencies belonging n both core and memory frequency scaling. Over a 2.5x range of to core and memory respectively and they affect different a both core and memory frequencies among 12 GPU kernels, our componentsofGPUs.Third,resourcecontentionisheavydue J modelachievesaccurateresults(within3.5%)onrealhardware. Compared with the cycle-level simulators, our model only needs to large number of concurrent threads. 9 some simple micro-benchmark to extract a set of hardware Somepreviousstate-of-the-artworkrevealsanalyticalpipe- 1 parameters and performance counters of the kernels to produce line GPU performance model [10], [11], [12], [13], [14], [15] this high accuracy. ] which emphasizes the relationship between compute cycles F Index Terms—Graphics Processing Units; Dynamic Voltage and memory latency. However, there still exist some oppor- P and Frequency Scaling; GPU Performance Modeling; tunities to reinforce the model. First, the L2 cache becomes . s larger and larger among the evolutionary GPU generations. c [ I. INTRODUCTION ComparedtoFermi2011,Maxwell2014hasfourtimeslarger L2 cache. Larger cache generally can increase the cache hit 1 RecentlytheGraphicsProcessingUnits(GPUs)arebecom- rate, which reminds us to consider more on L2 cache latency v ing widely used from Deep Learning (DL) workstations to and throughput instead of DRAM. Second, most of the pre- 8 0 high performance supercomputer centers. In particular, most vious models only work under the default frequency settings 3 popular DL toolkits [1], [2], [3], [4], [5] heavily rely on of GPU. The kernel behavior could change significantly when 5 the remarkable computation power of GPUs. However, due the core and memory frequencies have been adjusted. 0 to rapidly increasing computation requirements of both DL Simulationmethods[16],[17],[18]exposesufficientdetails . 1 toolkits and other GPUs applications on large amount of data, to help understanding the execution of GPU kernels. The 0 thetotalenergyconsumptioncanbeveryhigh,whichnotonly best available simulator to date [19] combines performance 7 resultsinhighelectricitybudgetsbutalsoviolatesgreencom- counters and specific hardware information to estimate the 1 puting.Forexample,thesupercomputerTitanacceleratedwith kernelexecutiontimewithhighaccuracy.However,compared : v the NVIDIA K20x requires a power supply of 8.21 million with the fast evolving GPU generations, such simulators still i X Watts with an electricity cost of about 23 million dollars per stand for the earlier GPU architecture like Fermi, which year [6]. Even decreasing 5% of the power consumption can does not meet the great changes of newer hardware. Besides, r a reduce up to 1 million dollars of electricity costs. Effective these simulators usually consume much longer time than real energy saving techniques are emergent to be designed for hardware,whicharedifficulttobeappliedtoreal-timepower- GPUs. performance optimization. Energy conservation techniques on modern computers are RecentGPUperformancemodels[20],[21],[22],[23],[24] generally based on Dynamic Voltage and Frequency Scaling also witness the trend of Machine Learning methods such as (DVFS). Nowadays GPUs usually support simple automatic K-means, multiple linear regression and neural network, and voltage and frequency adjustment in order to save power and obtainconsiderableaccuracy.However,fewofthemintroduce protectthehardware.Nevertheless,GPUshardlygainthebest frequency scaling as impact factors in their models. Besides, energy efficiency under the default voltage and frequency set- their works strongly rely on training data such as specific tings [7], [8] and still have potentials of energy conservation. performancecountersandkernelsettings.Eventheycanreveal To find the most energy efficient DVFS configurations, the somecorrelationsbetweentheinputparametersandexecution energy consumption under different DVFS settings should time, it needs further explorations of how they interact with be predicted, which requires modeling both performance and each other and contribute to the final time prediction. runtime power of GPUs under various settings of voltage and We believe that a fast and accurate GPU performance model is a key ingredient for energy conservation with DVFS Stream Processors * 16 technique and it should be applicable to real hardware. In this paper, we first attempt to model the memory system of GPU Instruction Cache Core Texture/L1 Cache with a FCFS (First-come-first-serve) queue in which service Instruction Buffer * 4 SFU Regs Shared Memory rate depends on the memory frequency. Based on that, we Warp Scheduler * 4 LD/ST Instruction Cache propose a GPGPU performance estimation model considering 32 32 128 both core and memory frequency scaling. Our paper reveals * * * L2 Cache SFU LD/ST following contributions: Cores Units Units 1) We model the memory system of GPU with a simple Texture/L1 Cache Memory Controller Shared Memory queue related with the frequency. DRAM DRAM DRAM 2) WeestablishananalyticalGPUperformancemodelwith both core and memory frequency scaling. Fig. 1: The block diagram of NVIDIA GTX980 GPU board. 3) On a real GPU hardware, our performance model achieves3.5%MAPE(MeanAveragePercentageError) across 49 frequency settings with up to 2.5x scaling thatGPUshavemorecomplexenergyscalingbehaviorswhen among12kernels.Meanwhile,weachieve0.7%to6.9% adopting different voltages and frequencies and scaling up the MAPE for each single kernel, which suggests great frequency does not often help reduce the energy consumption accuracy and low variance of our performance model. [7], [8]. The rest of this paper is organized as follows. Section II P =aCV2f (1) dynamic introducessomebasicknowledgeaboutGPUandDVFStech- Modern GPUs have two main frequency domains. One is niques followed by a motivating example about performance core frequency which mainly controls the speed of stream scaling behaviors with different frequency settings. Section multiprocessors (SMs) while the other is memory frequency III lists some related work. Section IV details our memory which affects the bandwidth of DRAM. Table I summarizes queuing model for GPU with frequency scaling and based on the dominating frequency for different types of memory. Note it, Section V proposes our GPGPU performance estimation that only DRAM works under memory frequency and L2 model with both core and memory frequency scaling. Section cache works under core frequency though they both serve the VI describes our experimental setup and presents the experi- global memory requests. mentalresults.Finallywestateourconclusionandfuturework in Section VII. TABLEI:DOMINATINGFREQUENCYFORDIFFERENT COMPONENTS. II. BACKGROUNDANDMOTIVATION A. GPU Architecture Components DominatingFrequency Over past five years NVIDIA has released five generations DRAM memoryfrequency L2Cache corefrequency of GPUs. The new functions and improvements through each SharedMemory corefrequency updated version can be obtained from [25]. Despite of some TextureCache corefrequency differenthardwareconfigurationlikecorenumberandmemory Register corefrequency size,thebasicchipframeworkisalmostthesame.Fig.1shows abriefblockdiagramofMaxwellGTX980GPU.Thestructure C. Performance Scaling Behaviors with Frequency details can be found in its official white paper. Note that AsdifferentGPUapplicationsmayhavevariousutilizations GPUshavecomplicatedmemoryhierarchyincludingdynamic of different hardware components, changing the frequencies random-access memory (DRAM), L2 cache, shared memory, may lead to diverse performance scaling behaviors for them. texture/L1 cache and registers. Different memory types have As a motivating example, we test a set of frequency pairs on their own characteristic in terms of latency, bandwidth and six GPU kernels to observe how the execution time changes. access scope, which makes it difficult to predict the execution We first fix the core frequency to 400 MHz and 1000 time of a GPU kernel. MHz respectively and scale the memory frequency from 400 B. Dynamic Voltage and Frequency Scaling MHz to 1000 MHz with a step size of 100 MHz. Illustrated DVFS is one of the most typical techniques of energy by Fig. 2(a) and 2(b), some kernels like transpose (TR), conservation for traditional CPUs. The dynamic power is blackScholes(BS),vectorAdd(VA)andconvolutionSeparable usually estimated by Equation (1) [26]. Since the total energy (convS) have almost over 2.5x speedup by increasing 2.5x consumptionofoneapplicationisobtainedbymultiplyingthe memory frequency, while the other two matrix multiplication average runtime power and the execution time, performance withglobalmemory(MMG)andwithsharedmemory(MMS) modeling plays an important role in energy consumption have negligible speedup. Another intesting finding is that two predictionwithdifferentDVFSsettings.FortraditionalCPUs, matrix multiplication kernels MMG and MMS have different scaling up the frequency is usually a good option to save en- scaling behaviors under different core frequency. Higher core ergy[27].However,somepreviousGPUDVFSworkindicates frequencyallowsthemtohavehigherspeedupwhenincreasing the memory frequency. The possible reason is that the perfor- III. RELATEDWORK mance is restricted by core frequency when core frequency is low while the performance is restricted by memory frequency To derive an accurate performance model of GPUs, it is whencorefrequencyishighenoughtodrivethecomputational quite important to understand its complex memory hierarchy. power. Henry Wong et al. [28] developed a micro-benchmark suite andmeasuredsomecharacteristicssuchascachestructureand latency of various memory types, TLB parameter, latency and throughput of arithmetic and logic operations. Meltzer [29] 3 BS 3 BS extended similar work on Tesla C2070. In addition, Xinxin MMG MMG 2.5 MTRMS 2.5 MTRMS Mei et al. [30] also conducted similar study but addressed VA VA convS convS more on memory hierarchy. They proposed a fine-grained P- Speedup 2 Speedup 2 chasemethodtoexplorethecacheparameterswithuncommon 1.5 1.5 structure and replacement policy which appears in the latest generation of GPUs (such as Kepler and Maxwell). However, 1 1 suchmethodsusuallytestasinglekernelwithonlyfewthreads 300 400 500 600 700 800 900 1000 1100 300 400 500 600 700 800 900 1000 1100 Memory Frequency/MHz, Fcore = 400MHz Memory Frequency/MHz, Fcore = 1000MHz executingonetypeofinstructions.Whenthousandsofthreads (a) (b) access memory simultaneously, which happens quite often in GPUapplications,thememorybandwidthmightnotsatisfythe 3 3 BS BS MMG MMG demands and some operations would be stalled in the queue 2.5 MTVRAMS 2.5 MTVRAMS of memory controller (MC). Such cases lead to high variance convS convS Speedup 2 Speedup 2 inHmoenmgoreyt aaclc.e[s1s0l]a,te[n1c1y]. proposed an analytical model by 1.5 1.5 estimatingdifferentdegreeofmemoryparallelismandcompu- 1 1 tation parallelism with some offline information of the kernel 300 400Core5 0F0requ6e0n0cy/M7H00z, Fm8e0m0 = 40900M0Hz1000 1100 300 400Core5 F00reque6n0c0y/M7H0z0, Fme8m00 = 1090000MH1z000 1100 program. Furthermore, Sim et al. [14] improves the above (c) (d) MWP-CWP model by considering cache effects, SFU char- acteristicsand instructionthroughput.However, theirmethods Fig.2:Performancescalingbehaviorunderdifferentfrequency ignore the effects of shared memory latency and DRAM settings. The upper two figures show the speedup of different memorylatencydivergence,whichmaybringsomesignificant GPU kernels when increasing memory frequency with fixed biases in some memory-bounded application. Song et al. [22] core frequency. The below two figures show the speedup of extend their models and address more on different types of different GPU kernels when increasing core frequency with memory access by collecting some simple counters. However, fixed memory frequency. the model averages the cache effects among all the warps and potentially ignores memory latency divergence in some asymmetric applications. Then we fix the memory frequency to 400 MHz and 1000 Nath et al. [12] present CRISP model which analyze the MHzrespectively,andscalethecorefrequencyfrom400MHz performance in the face of different frequencies of compute to 1000 MHz. Fig. 2(c) and 2(d) show that core frequency cores. They pointed out that DVFS on GPU is different has little effects on the performance of TR, BS and VA but from that on CPU since computation operations and memory great impacts on the other three. It is also observed that the transactions from different threads can overlap in most of performance can be limited by different frequency domain the time. Based on the characteristics of GPU performance with different frequency settings. withvaryingfrequenciesfoundfromexperiments,theyclassify As a result, under different frequency settings, the per- different execution stages in the kernel program and compute formance scaling behaviors can be diverse and complicated them with various frequencies. However, CRISP only works among different GPU kernels. Our goal is to establish an for the case of either scaling down the core frequency or estimation model that can predict the execution time of a scaling up the memory frequency. Also the model may be givenGPUkernelunderdifferentcoreandmemoryfrequency more complicated if considering the memory frequency. settings. In order to achieve it, we first explore how core Gene Wu et al. [20] built a performance model based on and memory frequencies affect different types of memory different patterns of scaling with various core frequency and includingDRAM,L2cacheandsharedmemory.Thatgivesus memory frequency. He firstly adopted K-means to cluster a quantitative model to estimate different memory latencies. different patterns of scaling behavior among 37 kernels and Then we use profiling tools to extract some performance then explored the relationship between performance counters countersfromrunningthekernelunderthebaselinefrequency and clustering patterns with ANN modeling. With the model settings. Collaborated with the memory model and the profil- trained with large amount of data, one can predict the perfor- ing data, the kernel execution time can be estimated under mance of one kernel under any setting of core frequency and other frequency settings. memory frequency with the predicted scaling pattern. IV. MEMORYMODELINGWITHFREQUENCYSCALING Warp 1 1 1 2 2 In the previous performance modeling work, memory la- Warp 2 1 1 2 2 Warp 3 1 1 2 2 tencyisusuallysetasaconstantparameterobtainedbymicro- 1 1 2 2 1 1 2 2 benchmarking. However, since the DRAM in the GPUs can 1 1 2 2 be accessed by any threads running on any SMs, the memory Time Memory Latency DRAM Load Memory Queue Delay latency of each thread may vary due to intensive memory Fig. 4: Execution time pipeline of intensive DRAM requests. transactions. Besides, memory latency can also change with The number on the block indicates the iterations of one warp. different frequency settings. In this section, we first model the memory latency with a simple queueing model. Then we measure the parameters used in the model with different 7.3163×108 9000 frequency settings for further performance modeling. 7.3162 Start Time End Time 8000 Am.eWmDhoRerAyn,Mointlaeutewsnuacarylplylatuankcehsesabaoumtemhuonrdyrerdeqsueosft ctoyctlhees gtolobgaol Time Stamp/cycle7777...7.3333.111315551678961 Memory Latency/cycle34567000000000000000 7.3156 2000 through DRAM if the data is not cached. This minimum 7.3155 1000 latency happens when the memory system is idle and only 7.31540 100 200 300 400 500 00 100 200 300 400 500 Warp Number Warp Number contains the overhead of path traveling and data access. (a) (b) Fig. 3 shows this case. The inter-arrival intervals between two consequent memory requests is shorter than the time Fig. 5: Experimental results of memory access latency. The consumptionofloadingdatafromDRAM.Thus,eachmemory timestampdata in5(a)issortedby starting timeinascending request only costs the minimum latency. To compute the total order. The memory access latency of each warp in 5(b) is latency of finishing all the memory requests in this case, we ascendingly re-ordered. only need to care how many memory requests are executed by a warp. Inferred from Fig. 3, the total time consumption the global bandwidth benchmark code of [30] and add clock T lat of all the memory requests can be calculated by measuring function clock() to collect memory latency sam- Equation 2. interArr denotes the inter-arrival time of two ples.To reduceasmuchoverhead ofclock() andtime record- consequent memory requests. #W denotes the total warp inginglobalmemoryaspossible,wesampleonlyonerequest number. dm lat denotes the minimum memory latency with for some threads. Fig. 5 shows our experimental results. We nomemorycontention.gld transdenotestheglobalmemory caninfertwothingsfromtheresults.First,thememorylatency transactions of one warp. can be diverse because of the intensive requests. Second, the memory latency is somehow linearly correlated with warp WWWaaarrrppp 123 1 1 11 1212 22 2 2 WWWaaarrrppp 123 1 1 11 21 21 2 2 ntoumexbpelrosr,ewhhoiwchfmreeqeutesntchyesmcaoldinelginafFfeicgt.s4d.mThelnatwaendwodumlddliekle. 2 2 We first utilize the global latency benchmark code of [30] to Time Time Memory Latency DRAM Load measure the dm lat under different memory frequencies. For Fig.3:ExecutiontimepipelineofinfrequentDRAMrequests. simplicity, we only measure the latency in the case that the The number on the block indicates the iterations of one warp. TLBcacheishit.TableIIshowspartofourresults.Thecycle is the time unit under GPU core frequency. We also measure dm latwithotherfrequencycombinationsandfindthatitcan T lat=interArr×#W +dm lat×gld trans (2) be fitted by Equation 4 with 0.9959 R-squared. Ifthememorysystemissaturatedduetotheintensivememory dm lat=222.78×core f/mem f +277.32 (4) requests, the minimum latency can hardly be achieved. Most requestsshouldwaitinthequeueuntilthepreviousoneshave As for DRAM delay measurement, we also use the global been finished. In Fig. 4, each memory request is launched bandwidth benchmark code of [30] to achieve as high DRAM with a very short interval so that each memory request not bandwidth as possible. Table III shows part of our results. only takes the minimum latency but also the queueing delay, To calculate dm del in cycles, we transfer the total execu- which means the waiting time in the queue. Thus, intensive tion time into cycles by multiplying the time and the core memory access demands can lead to diverse memory latency. frequency and calculate how many memory transactions of In this case, we can calculate the total time consumption by all warps. Then we can infer dm del by Equation 3 with Equation 3. dm del denotes the service time of one memory the obtained dm lat. We observe the DRAM delay is also transaction. correlated with the bandwidth efficiency, which means the percentage of DRAM bandwidth utilization. By increasing T lat=dm lat+dm del×gld trans×#W (3) memory frequency, the cycles of dm del become smaller and We also observe this phenomenon in experiments. We revise the bandwidth efficiency becomes larger, which suggests that high memory frequency helps improve utilization of DRAM Since we calculate the execution time in the scope of core bandwidth. frequency, there is no extra adjustment to those latency and throughput in SM. TABLEII:MINIMUMDRAMLATENCYUNDERDIFFERENT MEMORYFREQUENCIES V. GPUPERFORMANCEMODELINGWITHFREQUENCY Memory Core Cycles SCALING Freq./MHz Freq./MHz 400 400 500 The SMs in GPUs execute threads in groups of 32 parallel 500 500 455.5 threads called warps [25]. Generally GPU can launch a large 600 600 425.8 amount of warps during the whole kernel execution period. 700 700 404.6 800 800 388.7 However, due to the hardware resource limitation, one SM 900 900 376.3 canonlyexecuteacertainnumber(denotedby#Aw)ofwarps 1000 1000 366.4 concurrently called active warps, denoted by #Aw. Once we obtain the time consumption of a round of active warps, TABLEIII:DRAMREADDELAYUNDERDIFFERENTMEMORY denoted by T , the total execution time of a kernel can active FREQUENCIES be estimated by Equation 6. #B denotes the total number of Memory Core Cycles Bandwidth thread blocks; #Wpb denotes the number of warps per block; Freq./MHz Freq./MHz Efficiency 400 400 10.06 76% #SM denotes the number of SMs. 500 500 9.76 78.13% 600 600 9.54 79.8% Texec =Tactive∗(#Wpb∗#B/(#Aw×#SM)) (6) 700 700 9.31 81.83% 800 800 9.19 83.42% Basically, a GPU kernel can be divided into several segments. 900 900 9.06 84.51% Some segments do not access shared memory, while some 1000 1000 9.0 85% others do. Since they are influenced by different frequencies and usually have different working patterns, we classify the B. L2 cache GPU kernels segments into two categories by whether shared With the development of GPU generations, L2 cache memory transactions happen during the kernel execution. becomes larger and larger (e.g. from 512 KB for Fermi Some kernels also utilize texture/L1 cache for further perfor- GTX560Ti to2 MB for MaxwellGTX980) in orderto reduce mance improvement and somehow affect the accuracy of our thepressureofmemorysystem.Asmentionedbefore,different model. We will leave it as future improvement work for our cache hit rate may bring different sensitivity to frequency current model. scaling. Similar with the DRAM measurement experiments, A. Performance Modeling without Shared Memory we also use the same global latency code to obtain L2 cache latencyunderdifferentfrequencysettings.Weobservethatthe The first case of our model is that the GPU kernel does not latency is always within 220 to 224 cycles. This is reasonable utilize shared memory. In this case, the kernel only contains becauseL2cacheisaffectedbycorefrequency.Thus,wetake computation in SMs and global memory transactions. As the average 222 cycles for L2 minimum latency. Besides, we describedinSectionIV,wecanestimatethetimeconsumption choose 1 cycle for l2 del due to the truth that L2 cache can of global memory transactions. As for computation part, we return a memory request per core cycle with the same reason. simply assume that there happens the same computation time beforeeachglobalmemorytransactionasshowninFig.6.We C. Adjustment with frequency scaling divide the total compute instructions (denoted by comp inst) For simplicity, our performance model defines baseline fre- by total global memory transactions gld trans to obtain quencysettinginwhichtheratioofmemoryfrequencytocore average compute instructions number (denoted by avr inst). frequencyisone.Wemeasuresomebasiclatencyandthrough- Theaveragecomputationtime(denotedbyavr comp)before putforallthecomponentsincludingcorecomputation,shared eachglobalmemorytransactioncanbeestimatedbyEquation memoryaccess,constantmemory,L2cacheandDRAMunder (7a) and (7b). baseline setting. We can use the standard Average Memory Access Time (AMAT) [31] model to obtain average global avr inst=comp inst/gld trans (7a) memory access latency agl lat and average queueing delay avr comp=inst cycle×avr inst (7b) agl delofalltheglobalmemorytransactionshappenedduring If the kernel launches large number of computation instruc- kernel execution with Equation 5a and 5b. l2 hr denotes L2 tionsandonlyfewmemoryrequestswhichdonotsaturatethe cache hit rate of the kernel. core f and mem f denotes the memory bandwidth, the computation period will dominate the frequencies of core and memory respectively. total kernel execution time. On the other hand, if the memory agl lat=l2 lat×l2 hr+dm lat access latency is somehow much longer than computation ×(core f/mem f)×(1−l2 hr) (5a) cycle due to intensive memory requests, the computation period can be hidden by memory operations. The first case agl del=l2 del×l2 hr+dm del is regarded as compute-dominated or compute-bound kernel ×(core f/mem f)×(1−l2 hr) (5b) while the second is memory-dominated or memory-bound. C C Warp 1 C C (B3 lwocakrp 1s) C C C C Warp 2 C C C C Warp 3 C C Block 2 C C (3 warps) C C Time Memory Latency DRAM Load C Computation Memory Queue Delay Memory Latency DRAM Load Time C Computation Memory Queue Delay Fig. 8: Execution time pipeline of a kernel containing few Fig. 6: Execution time pipeline of a compute-dominated ker- warps that have short computation periods. Thus, the first nel. Since the kernel launches enough warps containing long memory request of each warp has waiting period while the compute cycles, most of the memory latency can be hidden. rest do not. When the kernel launches only a few warps, most latency cannot be hidden which leads to insufficient utilizations of C C (B3 lwocakrp 1s) C C the GPU. The memory latency may contribute a lot to the C C Block 2 C C execution time. There are two cases identified by whether C C (3 warps) C C avr compisshorterthanagl del.Whenavr compisshorter Time than agl del, the first memory request of each warp should Memory Latency DRAM Load C Computation Memory Queue Delay have waiting period as Fig. 8 shows. It can be described as Equation(12a)and(12b).WecanestimateT byEquation active (13) for this case. Fig. 7: Execution time pipeline of a memory-intensive kernel. Since each warp issues a few computation instructions but frequent memory access requests, one memory transaction Warp 1 C C cannot be processed until all outstanding transactions have Warp 2 C C Warp 3 C C been finished. Time Memory Latency DRAM Load C Computation Memory Queue Delay 1) Compute-Dominated: When there are enough computation instructions to be issued and the memory requests are Fig. 9: Execution time pipeline of a kernel containing few not too intensive due to long computation period, the global warpsthathavelongcomputationperiods.Thus,eachmemory memory latency can be hidden, as illustrated by Fig. 6. In requestdonotneedtowaitandcanbeprocessedimmediately. this case, Equation (8a) and (8b) should be satisfied. We can estimate T by Equation (9). o itrs denotes the repeat active times of one computation period and one global memory transaction. avr comp≤agl del (12a) (avr comp+agl lat)≤(agl del×(#Aw−1)) (12b) avr comp≥agl del (8a) (avr comp×(#Aw−1))≥agl lat (8b) Tactive =agl del×#Aw+agl lat+avr comp +(avr comp+agl lat)×(o itrs−1) (13) T =avr comp×#Aw×o itrs+agl lat (9) active When avr comp is longer than agl del, all the memory 2) Memory-Dominated: When the memory bandwidth is requests can be processed immediately once issued as Fig. saturated or there are not enough warps to issue computation 9 shows. It can be described as Equation (14a) and (14b). We instructions, one memory request is waiting until all the can estimate T by Equation (15). active outstanding requests have been finished. Fig. 7 demonstrates this case. The condition is described as Equation (10a) and avr comp≥agl del (14a) (10b). Similar with the case in Fig. 4, we can regard the com- (avr comp×(#Aw−1))≤agl lat (14b) pute cycles as inter-arrival time of two consequent memory requests.WecanestimateT byEquation(11)byfocusing active T =avr comp×(#Aw−1) on the agl del of each warp. active +(avr comp+agl lat)×o itrs (15) avr comp≤agl del (10a) B. Performance Modeling with Shared Memory (avr comp+agl lat)≥(agl del×(#Aw−1)) (10b) Shared memory plays an important role in GPU performance optimization since it has lower latency and higher T =agl lat+avr comp active throughput compared to DRAM and can be shared among +agl del×#Wpb×o itrs (11) the threads within the same block. Its latency and throughput is affected by core frequency. One GPU kernel may have differentpatternsofutilizationofsharedmemorywhichmakes 4 warps the performance estimation complicated. Basically, there are Block 1 three phases for this kind of kernels. At the beginning, all Block 2 the warps send memory requests to global memory and then store the data into shared memory. Second, threads within the Block N same block access shared memory for computation. Finally, Load data from DRAM Load data from Shared theresultsarewrittenbacktoglobalmemory.Accordingtothe to Shared Memory Memory and Compute accessworkloadtosharedmemory,wedesigntheperformance Memory Latency DRAM Computation Shared Load/Store Read/Write models for two cases as follows. Fig. 11: Execution time pipeline of matrix multiplication with shared memory. Phase 1 contains a large number of Block 1 global memory requests. Phase 2 contains multiple shared (4 warps) memory operations. Since Phase 2 is long enough to hide Block N the global memory latency of other blocks, the rest Phase 1 (4 warps) only has memory contention within the same block due to Memory Latency DRAM Load/Store synchronization function. Computation Shared Read/Write Fig. 10: Execution time pipeline of a kernel containing infre- 2) Shared memory requests are intensive : If the shared quencysharedmemoryaccess.Sincethereareveryfewshared memory access is intensive, its latency can contribute to memorytransactions,thefirstblockfinishesthemquicklyand the final execution time significantly. Fig. 11 shows matrix begins to send global memory request to DRAM while the multiplication with shared memory as an example. At the final block even have not finished the first global memory beginning all the warps load data from global memory. Then transaction. in the second phase each warp access the shared memory for multiple times, which consumes much time. Since the total 1) Shared memory requests are infrequent: Some kernels shared memory latency of phase 2 is longer enough to hide may only have few iterations of shared memory requests. In global memory latency of other blocks, the global requests this case, since the latency of shared memory access is much within the same block have no contention with others. Due to lower than that of global memory access, the shared memory the function of synchronization, we can regard this procedure latency can often be hidden by global memory latency. Fig. as repetition of these two big steps. Thus, the total execution 10 shows this case. In the first phase, all the warps are time of one round can be calculated by Eq. (18)∼(21). i itrs launching global memory requests and storing the data into denotes the number of shared memory transactions within shared memory, which results in heavy traffic in DRAM. phase2. In the second phase, each warp only executes one shared T =avr comp×2+agl del×gld trans memory access consuming only a small number of cycles. phase1 (18) Then each warp writes the results back to global memory, ×#Aw+agl lat+sh lat which again launches quite a number of global memory T =avr comp×(warps per block−1) phase2 transactions.ThepatternissimilarwithFig.7exceptthatitis (19) +(avr comp+sh lat)×i itrs shared memory latency to be hidden. The condition is given byEquation(16a)and(16b)forthiscaseandwecanestimate T =avr comp×2+agl del×gld trans phase3 the execution time by Equation (17). shm lat denotes shared (20) ×#Wpb+agl lat+sh lat memory latency. Transpose with coalesced optimization is one instance of this case. Since the shared memory latency T =T +(T +T )×o itrs (21) active phase1 phase2 phase3 can be hidden, the kernel is not sensitive to core frequency Matrix multiplication with shared memory optimization is but memory frequency, which also meets the results of our one instance of this case. Its performance is sensitive to previous motivating examples. both core and memory frequency which is revealed in our previous motivating examples and it can be explained by our avr comp≤agl del (16a) model. First, T and T contain a large number phase1 phase3 (avr comp+shm lat)≤(agl del×(#Aw−#Wpb)) (16b) of global memory transactions which makes the execution time sensitive to memory frequency. Second, although shared memory latency is much shorter than global memory latency, Tactive =avr comp+agl lat nearly 3 dozens of shared memory requests in Tphase2 also +agl del×#Aw×gld trans (17) contribute a lot to the final Tactive, which makes the execution time sensitive to core frequency as well. These two cases can be adopted to most classical GPU kernels. Though TABLEIV:PARAMETERSUSEDFORPERFORMANCEMODELING Parameters Definition Howtoachieve agl lat averagelatencyofglobalmemorytransactionconsideringL2cachehitrate Equation(5a) agl del averagedelayofglobalmemorytransactionconsideringL2cachehitrate Equation(5a) dm lat dramlatencyofoneglobaltransaction microbenchmarking dm del dramdelayofoneglobaltransaction microbenchmarking interArr Inter-arrivaltimebetweentwoconsequentmemoryrequests microbenchmarking l2 lat L2cachelatencyofoneglobaltransaction microbenchmarking l2 del L2cachedelayofoneglobaltransaction hardwarespecification l2 hr HitrateatL2cacheforalltransactionsfromSM Nsightprofiling sh lat latencyofonesharedmemorytransaction microbenchmarking gld trans Numberofglobalload/storetransactionsofonewarpinoneiteration Nsightprofiling comp inst totalcomputeinstructionsofthekernel Nsightprofiling avr comp averagecomputationtimeofoneperiod Equation(7b) inst cycle latencyforeachcomputationinstruction hardwarespecification #B Totalnumberofblocks kernelsetup #Wpb Numberofwarpsperblock kernelsetup #W Numberoftotalwarpsofakernel kernelsetup #Asm ActivenumberofSMs Nsightprofiling #Aw NumberofwarpsrunconcurrentlyononeSM Nsightprofiling o itrs Numberoffirstleveliterationwithinathread sourcecodeanalysis i itrs Numberofsecondleveliterationwithinathread sourcecodeanalysis Tlat Totalexecutiontimeofmultipleglobalmemoryrequests Equation(2),(3) Tactive CyclesforexecutingoneroundofactivewarpsonaSM Equation(9),(11),(13),(15),(17),(21) Texec Totalexecutiontimeofakernel Equation(6) core f frequencythatcontrolsthespeedofSM Adjustments mem f frequencythatcontrolsthespeedofDRAM Adjustments there are other more complicated irregular instances such as TABLEVI:TESTEDAPPLICATIONS MC EstimatePiInlinePandreduction,thesimilarmethodology abbr. ApplicationName BS BlackScholes of phase partition can somehow apply to them. We leave CG conjugateGradient detailed analysis for these irregular kernels for future work. FWT fastWalshTransform MMG matrixMul(Global) VI. EXPERIMENTS MMS matrixMul(Shared) SC scan A. Experimental Methodology SN sortingNetworks SP scalarProd With the help of NVIDIA Inspector [32], we can fix the TR transpose performance state and adjust core frequency and memory VA vectoraddition convSp convolutionSeparable frequency of GPU together within a certain range. By this method we can obtain execution time data with certain frequency combinations of GPU. We cover both core frequency listed in Table V. These benchmark applications cover a wide and memory frequency at a 2.5x range of scaling from 400 rangeofexecutionpatternsuchasDRAMintensive,L2cache MHzto1000MHzwithastepsizeof100MHzsothattotally intensive,sharedmemoryintensiveandcomputationintensive. 49 frequency combinations are tested. We repeat our experiments for 1000 times and report the average results. TABLEV:TARGETGPUFREQUENCYCONFIGURATIONS Device GTX980 Computecapability 5.2 gld_trans L2_trans shm_trans comp_inst SMs*coresperSM 16*128 n GlobalMemorybuswidth 256-bit o 1 GlobalMemorysize 4GB orti Corefrequencyscaling [400MHz,1000MHz] prop0.8 MSceamlinogrystsrciadleing [140000MMHHzz,1000MHz] on inst. 00..46 We use NVIDIA Nsight tools [33] to extract the perfor- zati ali0.2 mance counters we need to drive our model at the baseline m or 0 frequency 700 MHz for both core and memory. Note that N BS CG FWTMMGMMS MS SC SN SP TR VA convS we only need one time data collection with this method, Fig. 12: Breakdown of different types of instructions which makes our model work fast. We choose 700 MHz for baseline since it leaves the space for raising and declining the B. Experimental Results frequency,whichsuggeststhatourperformancemodelismore general and flexible. First, we would like to observe the instruction distributions Wevalidateourmodelamong12realisticGPUkernelsfrom of different GPU kernels with the help of NVIDIA Nsight CUDASDK6.5listedinTableVIonarealNVIDIAMaxwell tools.AsFig.12demonstrates,ourtestedkernelshavevarious GTX980. Hardware specifications of our test machine are partitions of different types of instructions, which suggests 30 %) 20 400 MHz 500 MHz 600 MHz 700 MHz 800 MHz 900 MHz 1000 MHz or( 10 Err n 0 o cti-10 di e Pr-20 core_f = 400 MHz -30 BS CG FWT MMG MMS MS SC SN SP TR VA convSp 30 %) 20 400 MHz 500 MHz 600 MHz 700 MHz 800 MHz 900 MHz 1000 MHz or( 10 Err n 0 o cti-10 di e Pr-20 core_f = 1000 MHz -30 BS CG FWT MMG MMS MS SC SN SP TR VA convSp 30 %) 20 400 MHz 500 MHz 600 MHz 700 MHz 800 MHz 900 MHz 1000 MHz or( 10 Err n 0 o cti-10 di e Pr-20 mem_f = 400 MHz -30 BS CG FWT MMG MMS MS SC SN SP TR VA convSp 30 %) 20 400 MHz 500 MHz 600 MHz 700 MHz 800 MHz 900 MHz 1000 MHz or( 10 Err n 0 o cti-10 di e Pr-20 mem_f = 1000 MHz -30 BS CG FWT MMG MMS MS SC SN SP TR VA convSp Fig.13: TimePredictionErrorunderdifferentfrequencysettings.Eachfigureshowstheresultsofscalingoneofthefrequencies when the other is fixed. that we attempt to design a general model for different types 15 of GPU kernels. In addition, such instruction statistics help us locate the principle contributors to the execution time %)10 E( undercertainfrequencysettings.Combinedwithexperimental P A M 5 results, we can also infer some error resources of under- or over-estimation of execution time. 0 BS CG FWTMMGMMS MS SC SN SP TR VAconvSp Fig. 13 shows the time prediction error under varying Fig. 14: Mean absolute percentage error average across all memory frequency with fixed core frequency and vice versa. available frequency pairs Across 49 available frequency settings among 12 kernels, we achievebelow16%errorforeachpredictionandeven90%of them are under 10%. As for each kernels, the mean absolute MatMul(S) have relative bigger under-estimation errors in percentageerror(MAPE)rangesfrom0.7%to6.9%asshown 13(a) than those in 13(b). The possible reason is that the time in Fig. 14. We achieve 3.5% MAPE across all the testing consumption in SM, shared memory access in particular, is samples. As mentioned before, some prediction errors can be under-estimated. As for MatMul(G), although it launches a explainedbyinstructiondistributionsinakernel.Forexample, great number of global memory transactions, it has a high L2 cache hit rate up to 97.5%, which results in its sensitivity [9] X.Mei,X.Chu,H.Liu,Y.-W.Leung,andZ.Li,“Energyefficientreal- to core frequency as well. Some kernels like convSp, FWT timetaskschedulingoncpu-gpuhybridclusters,”inIEEEINFOCOM. IEEE,2017. andSPhaveapproximatelylineardecreasingerrorswithlarger [10] S.HongandH.Kim,“AnAnalyticalModelforaGPUArchitecturewith memoryfrequencyin13(a)and13(b)butstableerrorsin13(c) Memory-levelandThread-levelParallelismAwareness,”inProceedings and13(d).Thesealsorevealthatthiskindofkernelsaremore ofthe36thAnnualInternationalSymposiumonComputerArchitecture, ser.ISCA’09. NewYork,NY,USA:ACM,2009,pp.152–163. sensitive to memory frequency, which are also supported by [11] ——, “An Integrated GPU Power and Performance Model,” in Pro- thefactsthattheyhavehighproportionofDRAMtransactions. ceedings of the 37th Annual International Symposium on Computer Architecture, ser. ISCA ’10. New York, NY, USA: ACM, 2010, pp. VII. CONCLUSION 280–289. [12] R. Nath and D. Tullsen, “The CRISP performance model for dynamic In this work, we demonstrated a GPGPU performance voltageandfrequencyscalinginaGPGPU,”inProceedingsofthe48th predictorforawiderangeofbothcoreandmemoryfrequency. InternationalSymposiumonMicroarchitecture. ACM,2015,pp.281– We first estimate the total time consumption of multiple 293. [13] R.Miftakhutdinov,E.Ebrahimi,andY.N.Patt,“Predictingperformance memory requests under different frequency settings. Then our impactofdvfsforrealisticmemorysystems,”inProceedingsofthe2012 predictor takes the profiling data of a given kernel under our 45thAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture. baselinefrequencysettingasinputtoestimatetheperformance IEEEComputerSociety,2012,pp.155–165. [14] J.Sim,A.Dasgupta,H.Kim,andR.Vuduc,“Aperformanceanalysis of it at other core and memory frequencies. Our model can frameworkforidentifyingpotentialbenefitsinGPGPUapplications,”in predict the execution time of a GPU kernel on real hardware ACMSIGPLANNotices,vol.47,no.8. ACM,2012,pp.11–22. quickly and accurately, which is important to derive real-time [15] X.Chen,Y.Wang,Y.Liang,Y.Xie,andH.Yang,“Run-timetechnique for simultaneous aging and power optimization in gpgpus,” in Design energy conservation suggestions with DVFS techniques. Automation Conference (DAC), 2014 51st ACM/EDAC/IEEE. IEEE, We shows that our performance estimation method can 2014,pp.1–6. achieve 3.8% MAPE across up to 2.5x both core and memory [16] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, “Gpuwattch: Enabling energy optimizations frequency scaling. Our experimental results also indicate that ingpgpus,”inProceedingsofthe40thAnnualInternationalSymposium our model can catch not only the performance scaling behav- onComputerArchitecture,ser.ISCA’13. NewYork,NY,USA:ACM, iors of DRAM very precisely but also L2 cache and shared 2013,pp.487–498. [17] J.Lucas,S.Lal,M.Andersch,M.Alvarez-Mesa,andB.Juurlink,“How memory. asinglechipcausesmassivepowerbillsGPUSimPow:AGPGPUpower As for future work, we have two directions of improve- simulator,”inPerformanceAnalysisofSystemsandSoftware(ISPASS), ments. First, our model does not explore too much about 2013IEEEInternationalSymposiumon. IEEE,2013,pp.97–106. [18] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, shared memory as we treat on DRAM and even does not take “Analyzing CUDA workloads using a detailed GPU simulator,” in texture/L1 cache and constant memory into account, which Performance Analysis of Systems and Software, 2009. ISPASS 2009. may introduce larger error for kernels containing access re- IEEEInternationalSymposiumon. IEEE,2009,pp.163–174. queststothem.Second,collaboratedwithGPUpowermodels, [19] T.M.Aamodt,W.W.Fung,I.Singh,A.El-Shafiey,J.Kwa,T.Hether- ington,A.Gubran,A.Boktor,T.Rogers,A.Bakhodaetal.,“GPGPU- it is potentially a remarkable project to build a real-time Sim3.xmanual,”2012. voltage and frequency controller for GPU based on energy [20] G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou, conservations strategies with DVFS techniques. “GPGPU performance and power estimation using machine learning,” in High Performance Computer Architecture (HPCA), 2015 IEEE 21st InternationalSymposiumon. IEEE,2015,pp.564–576. REFERENCES [21] Y.Abe,H.Sasaki,S.Kato,K.Inoue,M.Edahiro,andM.Peres,“Power [1] R.Collobert,K.Kavukcuoglu,andC.Farabet,“Torch7:Amatlab-like andperformancecharacterizationandmodelingofgpu-acceleratedsys- environment for machine learning,” in BigLearn, NIPS Workshop, no. tems,” in Parallel and Distributed Processing Symposium, 2014 IEEE EPFL-CONF-192376,2011. 28thInternational. IEEE,2014,pp.113–122. [2] M.Abadi,A.Agarwal,P.Barham,E.Brevdo,Z.Chen,C.Citro,G.S. [22] S. Song, C. Su, B. Rountree, and K. W. Cameron, “A simplified Corrado,A.Davis,J.Dean,M.Devinetal.,“TensorFlow:Large-scale and accurate model of power-performance efficiency on emergent gpu machinelearningonheterogeneoussystems,2015,”Softwareavailable architectures,”inParallel&DistributedProcessing(IPDPS),2013IEEE fromtensorflow.org,vol.1,2015. 27thInternationalSymposiumon. IEEE,2013,pp.673–686. [3] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, [23] H. Nagasaka, N. Maruyama, A. Nukada, T. Endo, and S. Matsuoka, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for “Statistical power modeling of GPU kernels using performance coun- fastfeatureembedding,”inProceedingsofthe22ndACMinternational ters,”inGreenComputingConference,2010International. IEEE,2010, conferenceonMultimedia,2014,pp.675–678. pp.115–122. [4] X. Huang, “Microsoft Computational Network Toolkit offers most [24] T. T. Dao, J. Kim, S. Seo, B. Egger, and J. Lee, “A performance efficient distributed deep learning computational performance,” https: model for gpus with caches,” Parallel and Distributed Systems, IEEE //goo.gl/9UUwVn,2015,accessed:2016-07-12. Transactionson,vol.26,no.7,pp.1800–1813,2015. [5] S. Shi, Q. Wang, P. Xu, and X. Chu, “Benchmarking state-of-the-art [25] NVIDIA, “CUDA C Programming Guide,” [Online] deep learning software tools,” in Proceedings of the 7th International http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html. Conference on Cloud Computing and Big Data, IEEE, Macau, China, [26] V.KursunandE.G.Friedman,“SupplyandThresholdVoltageScaling 2016. Techniques,”Multi-VoltageCMOSCircuitDesign,pp.45–84,2006. [6] O.R.N.Laboratory,“IntroducingTitan:advancingtheeraofacceler- [27] D. H. Kim, C. Imes, and H. Hoffmann, “Racing and Pacing to Idle: atingcomputing,”[Online]https://www.olcf.ornl.gov/titan/. TheoreticalandEmpiricalAnalysisofEnergyOptimizationHeuristics,” [7] X. Mei, Q. Wang, and X. Chu, “A Survey and Measurement inCyber-PhysicalSystems,Networks,andApplications(CPSNA),2015 Study of GPU DVFS on Energy Conservation,” Accepted by IEEE3rdInternationalConferenceon. IEEE,2015,pp.78–85. Digital Communication and Network (DCN). [Online]. Available: [28] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and https://arxiv.org/abs/1610.01784 A. Moshovos, “Demystifying GPU microarchitecture through [8] R.A.Bridges,N.Imam,andT.M.Mintz,“Understandinggpupower: microbenchmarking,” in Performance Analysis of Systems & Software Asurveyofprofiling,modeling,andsimulationmethods,”ACMCom- (ISPASS), 2010 IEEE International Symposium on. IEEE, 2010, pp. putingSurveys(CSUR),vol.49,no.3,p.41,2016. 235–246.