ANS MC2015 - Joint International Conference on Mathematics and Computation (M&C), Supercomputing in Nuclear Applications (SNA) and the Monte Carlo (MC) Method • Nashville, TN • April 19-23, 2015, on CD-ROM, American Nuclear Society, LaGrange Park, IL (2015) CONCURRENT CPU, GPU AND MIC EXECUTION ALGORITHMS FOR ARCHER MONTE CARLO CODE INVOLVING PHOTON AND NEUTRON RADIATION TRANSPORT PROBLEMS Noah Wolfe and Christopher Carothers Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 [email protected]; [email protected] Tianyu Liu and X. George Xu Nuclear Engineering Program Rensselaer Polytechnic Institute Troy, NY 12180 [email protected]; [email protected] ABSTRACT ARCHER-CT and ARCHER-Neutron are Monte Carlo photon and neutron transport applications that have now been updated to utilize CPU, GPU and MIC computing devices concurrently. ARCHER detects and simultaneously utilizes all CPU, GPU and MIC processing devices that are available. A different device layout and load-balancing algorithm is implemented for each Monte Carlo transport application. ARCHER-CT utilizes a new "self service" approach that efficiently and effectively allows each device to independently grab portions of the domain and compute concurrently until the entire CT phantom domain has been simulated. ARCHER- Neutron uses a dynamic load-balancing algorithm that distributes the particles in each batch to each device based on its particles per second rate for the previous batch. This algorithm allows multiple architectures and devices to execute concurrently. A near linear scaling speedup is observed when using only GPU devices concurrently. New timing benchmarks using various combinations of various Intel and NVIDIA devices are made and presented for each application. A speedup of 16x for ARCHER-Neutron and 44x for ARCHER-CT has been observed when utilizing an entire 4U, 9 device heterogeneous computing system composed of an Intel CPU, an Intel MIC and 7 NVIDIA GPUs. Key Words: ARCHER, Heterogeneous, CPU, GPU, MIC 1 INTRODUCTION As the number of heterogeneous computing systems increases, so does the need for applications that can fully utilize all the computing power available. The rapid acceptance of heterogeneous computing systems provides easy access to many computing architectures. One common limitation that many current applications possess is that they are designed to use only one architecture. While it may be true that one architecture can be better suited for an application than others, the best performance can usually be achieved when tapping into all computing resources available. In previous studies [1], [6], [10], we developed and optimized ARCHER (Accelerated Radiation-transport Computations in Heterogeneous EnviRonmentsis), an efficient Monte Carlo transport code on each architecture individually. The previous work reported a 33x speedup when comparing single GPU device execution against single thread CPU execution. Wolfe et al. This paper describes the development of a generic execution model that simultaneously utilizes all the computing architectures available. Two applications, discussed in this paper, which can benefit from full utilization of these new powerful heterogeneous systems, are in the field of nuclear radiation transport. On one hand we have a medical CT application, ARCHER-CT, that uses the Monte Carlo photon transport method. The capability to fully exploit a hardware system is much desired as it can potentially improve patient throughput. Some early research was conducted to simulate the CT scan procedure and calculate the radiation dose on the CPU, GPU and MIC platforms [12]. Another application that has been implemented on GPU and can benefit from concurrent execution is a Monte Carlo neutron transport eigenvalue calculation [9]. In each application, the codes are separately developed and tested on each CPU, GPU and MIC platform. The significance of the heterogeneous computing algorithms introduced in this paper is to maximize the system usage and increase throughput by dynamically detecting and concurrently utilizing all computing devices available. This paper reports the status of the ARCHER code in terms of algorithms for concurrent execution on CPU, GPU and MIC for both photon and neutron transport simulations. 2 BACKGROUND 2.1 Heterogeneous Computing Systems While there are still some exceptions to the rule, the general trend in scientific computing hardware is moving away from one-architecture systems. We are in an age of heterogeneous computing systems in which CPUs, GPUs and MICs are being integrated to help perform compute tasks. More and more of the supercomputer systems listed in the top 500 are becoming heterogeneous, containing more than one computing architecture. For example, the number one computer in the world as of November 2014, the Tianhe-2 (MilkyWay-2), consists of 32,000 12- core Intel Xeon Central Processing Units (CPUs) and 48,000 57-core Intel Xeon Phi Many Integrated Cores (MICs). Number two on the list is the Titan supercomputer composed of 18,688 16-core AMD Opteron CPUs and 18,688 NVIDIA K20X General Processing Units (GPUs). Three of the supercomputer systems listed use a combination of GPU and MIC accelerators. This very same trend is not just seen at the leadership supercomputing level, but also at the small cluster and even consumer desktop computing level. Fueled by the video gaming industry, consumer grade desktop computers are frequently powerful heterogeneous systems. These systems include Intel and AMD multi-core/multi-threaded CPUs paired with AMD and NVIDIA GPUs with over one thousand compute units. 2.1.1 Central processing unit (CPU) The CPU, a common computing architecture, is typically one physical chip consisting of several CPU cores. These CPU cores are completely independent with their own execution pipelines, registers and control units that allow them to execute application instructions in parallel. Many of today's CPU processors also have the ability to perform multiple threads per core. Although having multiple threads per core does not necessarily translate to linear speed- ups, it does allow each core to fully utilize its execution pipeline. 2.1.2 Graphics processing unit (GPU) A rapidly advancing architecture, the GPU is a very powerful computing option. Used as an accelerator, the GPU is attached to the host system (usually a CPU) via PCIe. On the hardware side of the GPU architecture, each GPU card contains a certain number of Streaming Page 2 of 12 Concurrent Execution Algorithms for Radiation Transport Problems Multiprocessors (SMs). These SMs are composed of a given number of lightweight Compute Unified Device Architecture (CUDA) cores that can each perform one instruction per clock cycle. For example, NVIDIA's Kepler microarchitecture contains 192 CUDA cores per SM (12 groups of 16) [11]. NVIDIA's K40 compute card contains 15 SMs, which brings the total number of compute units to 2880. On the software side, components are organized in much of the same way. At the highest level there are grids of blocks and each block is composed of a given number of threads. The number of blocks and threads per block are set and controlled by the user. When it comes to execution, each block of threads, when ready, gets assigned in warps (groups of 32 threads) to a group of 16 cores within an SM for computing. All threads in the same warp follow Single Instruction Multiple Thread (SIMT) execution [7]. They perform the same operations simultaneously on different data. As long as conditionals and divergence among threads in a block executing on an SM are kept to a minimum, an application can achieve impressive speedups. In the case of Monte Carlo transport, a threshold of threads is required to saturate the GPU to minimize divergence. 2.1.3 Many integrated core (MIC) The MIC architecture has a layout exactly as stated. MIC combines many lower powered processing cores onto one chip and connected via a bidirectional ring interconnect. With a frequency of around 1 GHz, each core supports 4 hardware threads. This large number of threads makes sure the computational pipeline of each core is efficiently utilized. To go along with the increased number of threads, the MIC architecture also increases the width of its vector units. Wider 512-bit vector units matched with a new SIMD instruction set called Initial Many Core Instructions (IMCI), mean it can perform up to 8 double-precision floating-point operations per clock cycle [4]. Intel's first iteration Knights Corner Phi coprocessors provide 57 to 61 cores in the form of a PCIe accelerator card. Similar to the GPU, the Intel Phi MIC architecture plugs directly into the motherboard and interacts with the system through the PCIe bus. All of this makes the Intel MIC architecture a very powerful parallel computing platform. 2.2 Neutron Criticality Application ARCHER-Neutron implements a simplistic neutron transport model used in our previous studies [8], [9]. The key assumptions and methods are as follows. (1) The problem considered is one-group criticality calculation, where capture, fission and isotropic scattering are modeled. (2) The geometry is 1-D slab selected from a V&V test in MCNP6.1 with ID number UD2O- H2O(10)-1-0-SL. The fuel is composed of U-235 and D O and the moderator is H O. The cross- 2 2 section information is given by reference [2]. (3) Shannon entropy is calculated to evaluate the source convergence. (4) Implicit absorption and weight window are used. (5) The eigenvalue and Shannon entropy results are verified against MCNP6.1 according to the method in [3]. A total of 130 batches are run and the first 30 batches are inactive. Our main focus is to assess the efficacy of the established concurrent execution model for simple criticality calculations. This is an important step before more intricate interaction physics, tallies and geometries are implemented. 2.3 Photon Computed Tomography Application ARCHER-CT is a simulation tool created for quantifying and reporting patient-specific radiation dose for X-ray Computed Tomography (CT) scan procedures. The photo-atomic interaction models include photoelectric effect, Compton scattering and Rayleigh scattering. Electron transport is not explicitly simulated and the corresponding energy is locally deposited because the photon energy for the CT application is usually no more than 140 keV and the Page 3 of 12 Wolfe et al. secondary electrons have shorter range than the dimension of a voxel. The code has a GE Lightspeed 16 Pro CT scanner model previously developed and tested in MCNP [5]. The code has been verified against MCNP [10]. In ARCHER-CT, the continuous circular motion of a CT scanner is approximated by 16 discrete equiangular positions per scanner rotation. In reference [5], one instance of MCNP run simulated one scanner rotation, whereby the initial source location was sampled from the 16 positions. The whole-body scan simulation involved many axial scanner rotations and therefore many instances of MCNP runs. To eliminate redundant file I/O, ARCHER-CT groups the simulations together. Each axial scan is simulated independently with a given number of particle histories. The parallel strategy takes advantage of independence of the individual histories computed for one scanner rotation and requires less inter-processor communication than other methods. This approach is considered massively parallel as no particle history depends on any information from any other particle history. The highly parallel nature of this simulation makes it a great candidate for heterogeneous concurrent execution. 3 PROGRAM IMPLEMENTATION Essential to the concurrent execution effort is the manner in which each of the devices in the heterogeneous system are connected, configured and operated. As shown in Figure 2, each computing architecture is connected to one another via the PCIe bus. The CPU is the main computing unit located in the CPU socket and the GPU and MIC devices are accelerator cards connected through the PCIe slots. 3.1 Device Organization Each application shares the same basic compute algorithm in that they are performing particle transport. All calculations are performed using double precision. The key difference is that the photon CT application is composed of many independent batches (i.e. scanner rotations) that can be computed in parallel. The neutron criticality application is a sequence of dependent batch simulations. As a result of these differences, we take two slightly different approaches to device organization and load balancing. 3.1.1 Self-Service Compute Structure First, working with the GPU devices requires the CPU to act as an operator. The CPU is needed to initialize memory, transfer data and submit kernel jobs to each of the GPU devices. For the CT photon application in which each particle transport calculation can be performed in parallel, we take an approach, shown in Figure 2, where we create one CPU process R0 via the Message Passing Interface (MPI) and dedicate one hardware thread via the Open Multi- Processing framework (OpenMP) to control the GPU. The remaining CPU hardware threads are grouped together with OpenMP in a separate MPI rank R1 to compute on their own. Both CPU and GPU can efficiently compute concurrently. Each additional GPU device gets its own MPI process/rank with one dedicated hardware thread. We use OpenMP here as opposed to POSIX threads (Pthreads) because each thread within the MPI process will perform the same set of operations. As a higher-level API, OpenMP makes it relatively simple to implement straightforward work sharing and parallelization. Page 4 of 12 Concurrent Execution Algorithms for Radiation Transport Problems The MIC on the other hand is setup to run according to the "native execution model" where the MIC devices are seen as extra compute nodes running their own micro Linux operating system. We have found that the most efficient layout is to create one MPI process for every 4 MIC hardware threads. Again, OpenMP is used to pin hardware threads to MPI processes. As shown in Figure 1 for the CT photon application, each MPI process that relates to a CPU, GPU or MIC compute device can now independently operate and compute in a self-serving manner. In the case of Monte Carlo particle transport, the computational domain is the set of scan rotation Figure 1. CT application patient phantom domain distribution and device layout. index values with n-many particles to simulate per index. These scan rotation index values are shown as the red rings in Figure. 1. The transport execution is setup so that each device grabs as many scan rotation index values they need for saturation. It then computes the particle paths for all particle histories within those index values, and then repeats the process until there are no more index values left. 3.1.2 Dynamic Load Balancing Compute Structure Unlike the CT application, which is inherently parallel, ARCHER-Neutron must follow a sequential batch process to account for the generation of new particles. The number of particles to simulate in the next batch is not known until the current batch is done. This prevents the use of the “self-serving” layout implemented for ARCHER-CT. Instead of fully independent devices, each device is setup to be semi-independent, computing a portion of each batch as shown in Figure 2. At the beginning of each batch, B0 to B130, the total number of particles to be simulated is efficiently distributed among the devices. After simulating its portion of the particles, each device must wait for all others to finish their portion of the batch to obtain the next batch particle count. The efficient distribution of the workload is done with a particles per second work rate calculation for each device. Knowing a devices’ particle per second work rate and the total work rate of all devices, which is obtained by an MPI_Allreduce(), each device can simulate n=[(deviceWorkRate/totalWorkRate)*nextBatchSize] many particles in the next batch. Page 5 of 12 Wolfe et al. Another key difference between ARCHER-CT and ARCHER-Neuron is that the ARCHER- Neutron application uses Pthreads for inter-node parallelization and device layout. The Pthreads API is chosen because of its lower level API that provides more fine-grained control of thread management. Before stepping into the compute section of the application, there is an initialization function to setup the GPU controlled Pthreads and bind them to a specific processor in the corresponding CPU. In contrast to the CT photon application where each GPU device is serviced by one MPI! process with one OpenMP hardware thread, here we control all GPU ! devices with one MPI process and N many Pthreads bound to one physical CPU hardware ! thread. Here, N is the! number of GPU devices. In essence, we are oversubscribing our hardware ! thread with many software Pthreads. This is an effective approach because the amount of service ! work the CPU needs! to perform for each GPU device is extremely minimal compared to the ! transport compute work. Therefore, we are able to save a significant amount of the CPU’s ! ! …! B0! B1! B2! B3! B4! B5! B6! B130! R2! Particles!Per!Batch!! R1! …! R0! CPU!(Host)! GPU!0!Device! GPU!N!Device! MIC!Device! ! R1! ! …! R2! R0! !!!!!!!!PCIe! Figure 2. Neutron criticality application particle distribution and device layout. physical compute power by assigning all remaining hardware threads to a separate MPI process via Pthreads. This separate MPI CPU process assists with computing a portion of the particles in each batch. The MIC setup differs slightly by allocating 1 MPI process with 240 Pthreads. Having one dense MPI process instead of the many lightweight MPI processes used for ARCHER-CT, may cut down on the performance of the MIC but also reduces the amount of delay in the synchronization and MPI communication step between batches. This is shown in Figure 2. 4 PROGRAM COMPOSITION In order to have one simulation work on multiple devices concurrently, multiple code sets are compiled separately and executed together. Here the functionality along with compilation and execution are described. Both applications, ARCHER-CT and ARCHER-Neutron consist of three separate code sets optimized for execution on a specific architecture. We have a CPU, GPU and MIC code set for each application. These code sets are then compiled separately and then Page 6 of 12 Concurrent Execution Algorithms for Radiation Transport Problems executed together in an execution script. It is worth noting that any compiler flags that may trade accuracy for performance were avoided. 4.1 Heterogeneous System Device Query In order for an application to fully utilize all underlying compute architectures in a heterogeneous system, it must first know what is available. To accomplish this, a separate program has been created using the Open Computing Language (OpenCL) and CUDA to detect the underlying architectures and their corresponding computational details. Based on the query result, the user can either select a subset of the available system architectures to use or simply select them all. After the selection process, the program outputs all relevant computing information to file. This information includes everything from processor frequency to the number of cores and threads per device. This data is then later used in an execution script to generate a heterogeneous concurrent execution runtime of the given application according to its optimal device layout as described in section 3.1. 4.2 Compilation and Execution on CPU and MIC The CPU and MIC code sets for each application are tailored and optimized for efficient and optimized execution on CPU and MIC devices by themselves or concurrently with other devices. As previously mentioned, MIC devices have been setup to run following the "native execution model”. This allows each MIC to run the exact same code as the CPU. The only difference is in the compilation step where we add the flag -mmic to specify that we are compiling and linking an executable for the MIC architecture. With a CPU compiled executable residing on the host CPU, and a MIC compiled executable of the exact same code residing on each MIC device, the Intel MPI management tool mpirun is used to launch the executables to run individually or compute concurrently with other devices. 4.3 Compilation and Execution on GPU This version of the ARCHER transport application is a subset of code that is tailored and optimized for efficient concurrent execution on both the CPU as the host and one or more GPUs as devices. The CPU rank computes the transport simulation using either Pthreads or OpenMP to utilize all remaining threads not subscribed to service GPUs as discussed in section 3.1. It is worth mentioning that in the ARCHER-CT case, the GPU streams are used to compute multiple scanRotationIndex values on a single GPU in an attempt to minimize the divergent irregular nature of Monte Carlo transport. Each stream is another instance of the transport compute kernel running with a different scan index. This allows for a better saturation of each GPU. With executables compiled using CUDA’s nvcc and Intel’s mpiicpc compilers, mpirun is used to run on the GPU individually or concurrently with other devices. 4.4 Concurrent Calculation on CPU, GPU and MIC With executables compiled for each architecture, concurrent execution across all three architectures is achieved simply with the mpirun utility. The mpirun utility is used to generate the corresponding number of MPI ranks running the corresponding executable on each CPU, GPU and MIC in the simulation. It is very important that each code set uses MPI blocking functions such as MPI_Barrier() in the same locations to prevent deadlock. Page 7 of 12 Wolfe et al. 5 HETEROGENEOUS COMPUTING SYSTEM In the interest of a broad and fair comparison, the new heterogeneous ARCHER implementations are tested on a number of devices as shown in Tables I and II. Each of these devices is located in a local 4U server size heterogeneous computer. In total there are 10 devices. There is one of each device in Tables I and II and four NVIDIA M2090 GPUs. 5.1 CPU and MIC Devices The ARCHER applications are tested on an older midrange Intel Xeon X5650 server CPU. The MIC architecture is much newer than the CPU so device selections are much more limited. As a result, only one device, the Intel Xeon Phi 5110P coprocessor, is tested. Table I. Intel CPU and MIC Device Comparison Product Intel Xeon Intel Xeon Model X5650 5110P Microarchitecture Westmere Knights Corner Cores 6 60 Threads Per Core 2 4 Compute Units 12 240 Clock Frequency 2.66 GHz 1.05 GHz 5.2 GPU Devices Due to the high availability, a number of NVIDIA GPUs have been benchmarked. First is a very popular and common desktop GPU used for computer video games, the NVIDIA GeForce GTX Titan. The second GPU is an older compute dedicated NVIDIA Tesla M2090. The last two GPU devices are the compute dedicated NVIDIA K20c and K40c. Table II. NVIDIA GPU Device Comparison Product NVIDIA NVIDIA NVIDIA NVIDIA Model GTX-Titan M2090 K20c K40c Microarchitecture Kepler Tesla Kepler Kepler Multiprocessors (MP) 14 16 13 15 CUDA Cores 2688 512 2496 2880 Max Threads 28672 24576 26624 30720 Clock Frequency 876 MHz* 1.3 GHz 706 MHz 876 MHz** *Boost Frequency listed. Base frequency is 837 MHz. **Boost Frequency listed. Base frequency is 745 MHz. 6 RESULTS 6.1 ARCHER-CT Photon As a performance comparison, ARCHER-CT has been executed with 90 scanRotationIndex values and 1×107 particles. It has been timed to measure the transport computation time and the timings do not include the file IO and device initialization time. Also, all timing results are computed on each rank using the instruction RDTSC. This machine level instruction returns the number of cycles the processor has completed since last reset. The time elapsed is expressed as (cyclesEnd-cyclesStart)/processorFrequency. Page 8 of 12 Concurrent Execution Algorithms for Radiation Transport Problems 6.1.1 Single device execution As can be seen in Table III below, the MIC architecture device performed better than the Intel server CPU as expected. All four GPU devices outperformed the MIC with the K40 GPU reaching a maximum speedup of 9.8x over the X5650 CPU. Simply put, the MIC architecture is currently being underutilized. The Xeon Phi MIC architecture includes 32 512-bit vector registers that can each pack in and simultaneously execute sixteen single precision floating-point numbers [4]. While Intel has new AVX-512 instructions that can take advantage of and utilize these 512-bit registers, the Intel C++ compiler is not able to auto-vectorize many of the complex loop structures in the Monte Carlo transport computation. This results in 512-bit registers capable of performing sixteen single precision floating-point calculations in one clock cycle but in actuality can only keep the result of one single precision floating-point number. Taking that into account, the Intel MIC device is posting a good result. Full utilization and implementation of the 512-bit vector units by ARCHER is still a work in progress. Table III. Single Device Execution Times Device Time (s) Speedup X5650 CPU 875.7 Baseline 5110P MIC 211.8 4.13x M2090 GPU 125.7 6.97x Titan GPU 86.4 10.14x K20c GPU 109.2 8.02x K40c GPU 89.5 9.78x 6.1.2 Concurrent execution As can be seen in Table IV below, speedup can be achieved by combining any of the architectures to execute concurrently. The parallel nature of the ARCHER-CT application requires very little communication between MPI ranks and absolutely no synchronization between devices. The little communication there is to send updated index value to all other ranks is a non-blocking send so there is no delay. Focusing on speedups, pairing an Intel MIC or an NVIDIA K40 GPU with the Intel Xeon CPU returns 5x and 10x speedups respectively. The results show a 13x speedup when adding both the Intel Xeon Phi MIC and NVIDIA K40 GPU devices to the Intel Xeon CPU. Finally utilizing an entire 4U heterogeneous compute system with one CPU, one MIC and 7 GPUs results in a 44x speedup over just the CPU alone. Table IV. Concurrent Execution Times and Speedups Execution Devices Speedup Time (s) CPU 875.7 Baseline CPU, MIC 173 5.01x CPU, K40 GPU 85.5 10.24x CPU, MIC, K40 69.3 12.64x CPU, MIC, 4x M2090, K40, Titan, K20 19.9 44.00x 6.2 ARCHER-Neutron Moving on to ARCHER-Neutron, we timed the entire transport computation and excluded file IO and device initialization time. For this application we again run with 1×107 initial Page 9 of 12 Wolfe et al. particles and continue for a total of 130 batches. Also, all timing results are computed on each rank using the MPI_Wall() timer. 6.2.1 Single device execution The execution times for ARCHER-Neutron running on each device in the 4U server system are presented in Table V. The execution speedups are lower than those for the ARCHER-CT application. This is expected, as the neutron implementation requires a sequential compute process. The GPUs give good speedups with respect to the X5650 CPU but the MIC fails to achieve the single device speedup observed in ARCHER-CT. This is again believed to be because of underutilization of the vector processing units within the MIC. Also, it is expected that executing a better ratio of Pthreads to MPI ranks on the MIC would gain better speedup in the computation at the cost of time spent in MPI communication and synchronization after each batch. Table V. Single Device Execution Times Device Time (s) Speedup X5650 CPU 617.1 Baseline 5110P MIC 600.7 1.03x M2090 GPU 240.5 2.57x Titan GPU 230.1 2.68x K20c GPU 128.6 4.80x K40c GPU 90.3 6.84x 6.2.2 GPU scaling In order to test the efficiency of the dynamic load-balancing algorithm based on particle compute rate, we test the ARCHER-Neutron application scaling on multiple GPUs. The same application is also run and timed with the original uniform load balancing method for comparison. As presented in Table VI, both methods for load balancing achieve the same compute times when using the same M2090 devices because each device has the same work rate and therefore the dynamic method becomes the same as the uniform method. When adding non- similar GPU devices, the dynamic load balancing method outperforms nicely. Using all 7 GPU devices in the 4U compute system, the load balancing method has a speedup difference of 2x over the uniform method. Speedup difference is computed using (s_dyn-s_uni)/s_uni*100. Table VI. Multi-GPU Execution Times and Speedups Uniform Load Balancing Dynamic Load Balancing Speedup Device Time (s) Speedup Time (s) Speedup Difference 1x M2090 240.5 Baseline 240.5 Baseline 0% 2x M2090 121.2 1.98x 121.3 1.98x 0% 3x M2090 81.7 2.94x 81.8 2.94x 0% 4x M2090 62.1 3.87x 62.2 3.92x 1% 4x M2090, K40 50.4 4.77x 38.6 6.24x 31% 4x M2090, K40, Titan 42.6 5.65x 33.9 7.09x 25% 4x M2090, K40, Titan, K20 37.9 6.35x 28.7 8.38x 32% Page 10 of 12
Description: