ebook img

Profiling GPU/CPU Performance in Windows, RHEL, OSX, iOS & Android PDF

327 Pages·2017·34.38 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Profiling GPU/CPU Performance in Windows, RHEL, OSX, iOS & Android

Profiling GPU/CPU Performance in Windows, RHEL, OSX, iOS & Android Geekbench Compute OpenCL and CUDA Scores 500000 450000 400000 A7‐GPU‐O 350000 DROID‐O 300000 4000‐O 4600‐O 250000 HD630‐O 200000 GTX860M‐O GTX860M‐C 150000 GTX1050‐O GTX1050‐C 100000 50000 0 overall Sobel Histogram SFFT Gaussian Face RAW Depth of Particle Equalization Blur Detection Field Physics Geekbench CPU Scores CPU single core CPU multiple core 20000 17224 18000 16000 14000 11722 12000 10632 10000 8000 5091 6000 3620 3368 4000 2959 12972198 1061 2000 0 ipad droid turbo mac mini asus luggable asus desktop August 16, 2017 2:59 am Profiling GPU/CPU Performance in Windows, RHEL, OSX , iOS & Android 1.0 Introduction This succinct study illustrates the use of VTUNE CPU/GPU Concurrency & Hotspot Analyses, Nvidia System Manage- ment Interface and Visual Profiler, Visual Studio, the Intel Graphics Performance Analyzer, Xcode/Instruments, Android Studio and Snapdragon Profiler for recording and analyzing GPU and CPU performance statistics. The Geekbench OpenCL and CUDA benchmarks, the Nvidia VS Samples, Online Gaming (OVERWATCH), Adobe Premiere Pro, Deep Neural Networks (TensorFlow), and Google Virtual Reality are investigated in Windows 8.1, Windows 10, RHEL7, OSX 10.12.5, iOS 10.3.2, and Android 6.0.1 on Intel, Nvidia, Apple (Samsung) and Motorola (Qualcomm) GPUs. http://davidjyoung.com/images/PUBLICATIONS.pdf This document was produced using Word 2010, Excel 2010, Acrobat XI Pro, Visio 2003, and Structured FrameMaker 2017. The associated EDD is a modified version of the EDD presented in Chapter 2 of the Adobe FrameMaker 8 Struc-tured Application Developer Guide, and is included as the final chapter in this book on page 94. introduction.fm 2 Profiling GPU/CPU Performance in Windows, RHEL, OSX , iOS & Android 2.0 Geekbench Pertinent Geekbench statistics are shown following. Geekbench Compute OpenCL and CUDA Scores 500000 450000 400000 A7‐GPU‐O 350000 DROID‐O 300000 4000‐O 4600‐O 250000 HD630‐O 200000 GTX860M‐O GTX860M‐C 150000 GTX1050‐O GTX1050‐C 100000 50000 0 overall Sobel Histogram SFFT Gaussian Face RAW Depth of Particle Equalization Blur Detection Field Physics Figure 1 – Geekbench Compute Results platform ipad droid turbo mac mini asus luggable asus luggableasus luggableasus desktop asus desktop asus desktop CPU model A7 snapdragon 805 I7‐3615QM I7‐4700 I7‐4700 I7‐4700 I7‐7700 I7‐7700 I7‐7700 OS IOS 10.3.2 Android 6.0.1 OSX 10.12.5Win 8.1 Win 8.1 Win 8.1 Win 10 Win 10 Win 10 CPU single core 1297 1061 3368 3620 3620 3620 5091 5091 5091 CPU multiple core 2198 2959 10632 11722 11722 11722 17224 17224 17224 GPU vendor apple qualcomm intel intel  nvidia nvidia intel  nvidia nvidia GPU model a7gpu adreno 420 4000 4600 GTX860M GTX860M HD630 GTX1050 GTX1050 geekbench workload OpenCL Opencl OpenCL OpenCL OpenCL CUDA OpenCL OpenCL CUDA overall result 543 3860 6167 13809 45547 46903 22639 76263 82509 Sobel 317 3412 3056 17219 92620 85611 29143 139765 134272 Gpixels/sec 0.014 0.15 0.13 0.75 4.08 3.77 1.28 6.16 5.92 Histogram Equalization 326 4901 2409 13699 68766 54343 26091 106619 101866 Gpixels/sec 0.01 0.153 0.075 0.43 2.15 1.7 0.82 3.33 3.18 SFFT 36 743 2324 2520 8658 8689 3664 12589 12763 Gflops 0.09 1.85 5.8 6.28 21.6 21.7 9.13 31.4 31.8 Gaussian Blur 249 13391 4271 15721 42874 64346 35318 76858 117422 Gpixels/sec 0.004 0.234 0.074 0.275 0.751 1.13 618.7 1.35 2.06 Face Detection 649 2820 2247 8020 10988 13937 13888 24030 23248 Msubwindows/sec 0.189 0.823 0.656 2.34 3.21 4.07 4.06 7.02 6.79 RAW 1726 4463 26458 48823 268237 160936 90633 441642 274802 Gpixels/sec 0.017 0.043 0.256 0.47 2.6 1.56 0.877 4.27 2.66 Depth of Field 2172 11453 24064 24861 121874 99608 36113 201635 146756 Mpixels/sec 6.31 33.3 69.9 72.2 354 289 104.9 586 426 Particle Physics 3391 2059 20017 14539 21810 40308 15248 37103 111741 FPS 536 326 3164 2298 3448 6372 2439 5865 17664 Figure 2 - Geekbench Compute (GPU) and CPU Results geekbench.fm 3 Profiling GPU/CPU Performance in Windows, RHEL, OSX , iOS & Android Geekbench CPU Scores CPU single core CPU multiple core 20000 17224 18000 16000 14000 11722 12000 10632 10000 8000 5091 6000 3620 3368 4000 2959 12972198 1061 2000 0 ipad droid turbo mac mini asus luggable asus desktop Figure 2A - Geekbench CPU results Figure 3 shows GTX860M GPU utilization and temperature over time for the OpenCL workload. Figure 4 shows GTX- 860M GPU utilization and temperature over time for the CUDA workload. The graphs were created with data from the Nvidia System Management Interface. Intel4600 GPU and CPU busy, EU Stall, Watts, and percentage of TDP or MAX power are shown in Figure 5, created with data from the System Analyzer, one of the Intel Graphics Performance Ana- lyzers. GTX860M OpenCL GPU Busy & Temperature GTX860M Cuda GPU Busy & Temperature 100 100.00 80 80.00 60 60.00 40 40.00 20 20.00 0 0.00 09:14.809:15.909:16.909:17.909:18.909:19.909:20.909:21.909:22.909:23.909:24.909:25.909:26.909:27.909:29.009:30.009:31.009:32.009:33.009:34.009:35.009:36.009:37.009:38.009:39.009:40.009:41.009:42.009:43.009:44.009:45.009:46.009:47.109:48.109:49.109:50.109:51.109:52.109:53.109:54.109:55.109:56.109:57.109:58.109:59.110:00.210:01.2 GPU Busy Temp. GPU Busy Temp. Figure 3 & 4– OpenCL, CUDA GPU Busy and Heat GTX860M geekbench.fm 4 Profiling GPU/CPU Performance in Windows, RHEL, OSX , iOS & Android Intel4600 OpenCL GPA ‐ System Analyzer GPU Busy (%) Aggregated CPU Load (%) EU Stall (%) Package Power (Watts) Package Power/TDP (%) 100.00 80.00 60.00 40.00 20.00 0.00 19753197531975319753197531975319753197531975319 1357902468913578024679135680245791346802357912 11111122222333333444445555556666677777788 Figure 5 – OpenCL Intel4600 GPA System Analyzer Stats geekbench.fm 5 Profiling GPU/CPU Performance in Windows, RHEL, OSX , iOS & Android 3.0 NVSMI CLI From the product documentation: “nvidia-smi (also NVSMI) provides monitoring and management capabilities for each of NVIDIA's Tesla, Quadro, GRID and GeForce devices from Fermi and higher architecture families. GeForce Titan series devices are supported for most functions with very limited information provided for the remainder of the Geforce brand. NVSMI is a cross platform tool that supports all standard NVIDIA driver-supported Linux distros, as well as 64bit versions of Windows starting with Windows Server 2008 R2. Metrics can be consumed directly by users via stdout, or provided by file via CSV and XML formats for scripting purposes.”  https://developer.nvidia.com/nvidia-system-management-interface Figures 6 and 7 Illustrate the use of the NVSMI CLI to collect GPU stats and log the results to STDOUT and a .csv file. Figure 6 – NVSMI CLI example: log to a file and stdout nvidia-smi.exe --query-gpu=timestamp,utilization.gpu,utilization.memory,temperature.gpu,memory.total,mem- ory.free,pstate --format=csv -l 1 -f out.log Figure 7 – NVSMI CLI example: log to a file Figure 8 shows the location of the Nvidia GPU Utilization monitor. Figure 9 shows the execution of the OpenCL and CUDA workloads on the GTX860M. Figure 8 & 9 – NvGpuUtilization.exe, Geekbench OpenCL, CUDA nvsmi.fm 6 Profiling GPU/CPU Performance in Windows, RHEL, OSX , iOS & Android 4.0 Intel Graphic Performance Analyzer From the product documentation: “System Analyzer - View CPU, GPU, and graphics API performance in real time and quickly identify where your game's performance needs improvement. * Get metrics for CPU, GPU, graphics drivers, DirectX*, OpenGL*, or OpenGL* ES * Experiment with override modes that quickly isolate common performance bottlenecks * Capture frames and traces for further analysis * Display up to 16 performance metrics simultaneously * Monitor the current, minimum, and maximum frame rate * Use without code modifications or special libraries” https://software.intel.com/en-us/gpa Figure 10 following is the Intel GPA Systems Analyzer: Aggregated CPU load, GPU Busy, EU Stall % is in the top win- dow, Power information is in the bottom window. Figure 10 – Intel GPA – Systems Analyzer Click on the CSV icon to start recording performance data to a .csv file, stored in the “ThisPC\Documents\GPA\” folder. intelgpa.fm 7 Profiling GPU/CPU Performance in Windows, RHEL, OSX , iOS & Android 5.0 VTUNE CPU/GPU Concurrency Analysis From the product documentation: “CPU/GPU Concurrency analysis type is intended for platform analysis of applications that use a Graphics Processing Unit (GPU) for rendering, video processing, and computations. Use this analysis type as a starting point to understand the code execution on the various CPU and GPU cores in your system and identify whether your target application is GPU or CPU bound. The tool infrastructure automatically aligns clocks across all cores in the entire system so that you can analyze some CPU-based workloads together with GPU-based workloads within a unified time domain. Use the CPU/GPU Concurrency analysis to: • Explore GPU usage and analyze a software queue for GPU engines at each moment of time • Correlate CPU and GPU activity and identify whether your application is GPU or CPU bound • Identify GPU and CPU application frame rate and how it depends on vertical synchronization • Explore the performance of your application per selected GPU metrics over time • Analyze execution of Intel Media SDK tasks over time (for Linux targets only) • Explore your application performance for user tasks created with Intel ITT API” https://software.intel.com/en-us/amplifier_help_windows Figure 11 highlights pertinent information gleaned from the CPU/GPU Concurrency Analyses. Detailed VTUNE statistics are in Appendix A. Intel4600 IntelHDG630GTX860M GTX860M GTX1050 GTX1050 OpenCL OpenCL CUDA OpenCL CUDA OpenCL elapsed 62.6 68.3 53.0 43.7 43.6 37.7 total CPU seconds 90.1 89.0 114.3 88.1 48.0 38.6 instructions (billions) 369 234 574 430 251 163 CPI 0.709 1.005 0.641 0.661 0.669 0.78 total threads 711 1172 712 607 730 641 frame count 4694 970 3785 3240 581 536 total GB/sec 4.56 5.05 2.33 2.40 3.29 3.36 read GB/sec 3.02 3.58 1.43 1.55 2.32 2.45 write GB/sec 1.54 1.47 0.90 0.85 0.98 0.91 CPU Busy 17.1 16.3 26.9 25.1 13.7 12.8 GPU Busy 48.5 22.3 36.0 22.3 28.7 16.4 CPU & GPU active 30.76 15.88 19.40 32.50 0.01 0.60 CPU only active 31.90 52.42 33.60 0.00 0.02 0.91 GPU only active 0.00 0.00 0.00 0.00 12.81 7.00 CPU & GPU idle 0.00 0.00 0.00 0.00 30.77 29.20 CPU active 62.66 68.30 53.00 43.80 0.03 1.51 GPU active 30.73 15.88 19.40 11.20 12.82 7.60 Geekbench CPU 13.06 8.72 28.80 18.90 19.56 12.99 Geekbench GPU 30.18 15.14 19.10 10.90 12.52 7.28 Figure 11 – VTUNE CPU/GPU Concurrency Highlights Figures 12 and 13 list the top CPU consumers for the Nvidia CPUs. Figure 14 lists the top CPU and GPU consumers for the Intel4600 OpenCL workload obtained via the VTUNE GPU Hotspots analysis with Stacks. vtunegpuconc.fm 8 Profiling GPU/CPU Performance in Windows, RHEL, OSX , iOS & Android gtx1050 cuda gtx1050 opencl Process / Module / FunctioEffective TGPU TimeProcess / Module / FunctiEffective TGPU Time geekbench_x86_64.exe 19.54 12.5167geekbench_x86_64.exe 12.962 7.28441  ntoskrnl.exe 9.07398  ntoskrnl.exe 5.93999  geekbench_x86_64.exe 4.76799  geekbench_x86_64.exe 3.07299  nvlddmkm.sys 1.667  nvopencl.dll 1.484  nvcuda.dll 1.664  nvlddmkm.sys 1.008  dxgkrnl.sys 1.421  dxgkrnl.sys 0.797999  hal.dll 0.449999  hal.dll 0.283  win32u.dll 0.212  win32u.dll 0.107  watchdog.sys 0.089  ntdll.dll 0.077  dxgmms2.sys 0.059  watchdog.sys 0.056  ntdll.dll 0.037  dxgmms2.sys 0.045  win32kbase.sys 0.043  win32kbase.sys 0.022  win32kfull.sys 0.011  nvfatbinaryloader.dll 0.008  fltmgr.sys 0.006  mfeaack.sys 0.007  win32k.sys 0.004  mfencbdc.sys 0.007  kernel32.dll 0.004  fltmgr.sys 0.006  gdi32full.dll 0.004  mfehidk.sys 0.005  mfehidk.sys 0.003  mfehidk.sys 0.005 Figure 12 – VTUNE Geekbench GTX1050 Top Consumers gtx860m cuda gtx860m opencl Process / Module / FunctioEffective TGPU TimeProcess / Module / FunctiEffective TGPU Time geekbench_x86_64.exe 28.7634 19.0818geekbench_x86_64.exe 18.9147 10.895  ntoskrnl.exe 13.1684  ntoskrnl.exe 8.56778  geekbench_x86_64.exe 6.2364  geekbench_x86_64.exe 4.27688  nvlddmkm.sys 2.99291  nvopencl.dll 2.11488  nvcuda.dll 2.49376  nvlddmkm.sys 1.74904  dxgkrnl.sys 2.3825  dxgkrnl.sys 1.28797  hal.dll 0.647495  hal.dll 0.379877  gdi32.dll 0.313724  gdi32.dll 0.180417  watchdog.sys 0.205474  ntdll.dll 0.075174  dxgmms1.sys 0.142329  watchdog.sys 0.100231  win32k.sys 0.106245  dxgmms1.sys 0.087201  ntdll.dll 0.047109  win32k.sys 0.042097  fltmgr.sys 0.006014  bhdrvx64.sys 0.012028  ntfs.sys 0.003007  ntfs.sys 0.005012  kernelbase.dll 0.003007  srtsp64.sys 0.004009  cng.sys 0.002005  kernel32.dll 0.004009  srtsp64.sys 0.002005  nvfatbinaryloader.dll 0.004009  bhdrvx64.sys 0.002005  storport.sys 0.002005 Figure 13 – VTUNE Geekbench GTX860M Top CPU vtunegpuconc.fm 9 Profiling GPU/CPU Performance in Windows, RHEL, OSX , iOS & Android intel4600 opencl CPU  GPU     GPU                 Process / Module / FunctioTime Time Computing Task (GPU) / Time AverageCount GB/secActiveStalled Idle threads geekbench_x86_64.exe 9.068930.1763particle 1.88 0.0004 4556 0.00 0.567 0.363 0.071 280160  geekbench_x86_64.exe 4.2258 fft 1.36 0.0069 197 0.00 0.186 0.809 0.00551315456  ntdll.dll 0.1544 [Unknown] 1.14 0.0000 0.00 0.011 0.011 0.97825890964  ntoskrnl.exe 2.3875 clEnqueueReadBuffer 0.80 0.0080 100 3.53 0.018 0.953 0.02910705784  igdrcl64.dll 1.7380 lens_blur_gpu 0.72 0.0312 23 0.00 0.631 0.358 0.012 3368023  igdfcl64.dll 0.3047 particle 0.69 0.0009 790 0.00 0.325 0.640 0.035 49568  igdusc64.dll 0.0712 detect 0.61 0.0307 20 0.00 0.410 0.539 0.051 224206  igdkmd64.sys 0.0461 clEnqueueWriteBuffer 0.44 0.0011 387 7.08 0.036 0.904 0.05911415798  dxgmms1.sys 0.0281 clEnqueueWriteImage 0.39 0.0044 90 0.00 0.078 0.867 0.055 8582722  dxgkrnl.sys 0.0261 raw_quantize 0.37 0.0184 20 0.00 0.937 0.062 0.00120276756  nvlddmkm.sys 0.0150 convolve_1d_vertical 0.36 0.0178 20 0.00 0.828 0.171 0.00113108628  ntfs.sys 0.0090 convolve_1d_horizontal 0.35 0.0174 20 0.00 0.836 0.145 0.01912403999  kernelbase.dll 0.0040 clEnqueueReadImage 0.33 0.0057 59 0.00 0.051 0.883 0.065 6485273  fltmgr.sys 0.0080 raw_process 0.29 0.0145 20 0.00 0.917 0.060 0.023 5040070  hal.dll 0.0070 particle 0.25 0.0004 590 0.00 0.562 0.372 0.066 36544  iastora.sys 0.0060 particle 0.25 0.0004 598 0.00 0.560 0.370 0.070 36864  storport.sys 0.0060 particle 0.25 0.0004 597 0.00 0.564 0.369 0.067 36928  igdbcl64.dll 0.0050 particle 0.25 0.0004 589 0.00 0.560 0.371 0.069 36416  bhdrvx64.sys 0.0040 fft 0.23 0.0069 34 0.00 0.187 0.804 0.010 8825847 Figure 14 – VTUNE Geekbench Intel4600 Top CPU/GPU Figures 15 – 17 graphically illustrate GPU Software Queue, GPU Usage, CPU time (with overlapping GPU time) and System Bandwidth for three of the measured environments as produced by Vtune CPU/GPU Concurrency Analysis. If it seems like deja-vu all over again (and again…) it is because the application is simple and predictable, with most of the measured differences related to hardware speeds. Figure 15 – GTX860M VTUNE CPU/GPU Concurrency OpenCL vtunegpuconc.fm 10

Description:
CPU multiple core This document was produced using Word 2010, Excel 2010, Acrobat XI Pro, Visio . The graphs were created with data from the.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.