Lecture Notes ni Computer Science Edited .yGb Goos and .J Hartmanis 374 yaK .A snibboR nevetS snibboR ehT X-MP/Model Cray 24 A Case Study ni denilepiP Architecture dna Vector gnissecorP I I galreV-regnirpS Heidelberg Berlin NewYork London Paris oykoT Hong gnoK Editorial Board D. Barstow W. Brauer .P Brinch Hansen D. Gries D. Luckham C. Moler A. Pnueli G. Seegrn(.iller .J Stoer N. Wirth Authors Kay A. Robbins Steven Robbins Division of Mathematics, Computer Science, and Statistics The University of Texas at San Antonio San Antonio, TX 78285, USA CR Subject Classification (1987): C. 1.2, D.4.1 ISBN 0-387-9?089-4 Spr!nger-Verlag New York Berlin Heidelberg ISBN 3-540-97089-4 Springer-Verlag Berlin Heidelberg New York sihT work si tcejbus to .thgirypoc llA sthgir era ,devreser rehtehw eht of part or whole eht lairetam si yllacificeps ,denrecnoc the sthgir of ,noitalsnart esu-er ,gnitnirper of ,snoitartsulli ,noitaticer ,gnitsacdaorb noitcudorper no smliforcim or ni other ,syaw dna egarots ni data .sknab noitacilpuD this of noitacilbup or strap thereof si only dettimrep under eht snoisivorp the of namreG thgirypoC waL of rebmetpeS ,9 ,5691 ni sti noisrev of enuJ ,42 ,5891 dna a thgirypoc fee tsum syawla eb .diap snoitaloiV the under fall noitucesorp act fo eht namreG thgirypoC .waL © galreV-regnirpS Berlin grebledieH 9891 detnirP ni ynamreG gnitnirP dna :gnidnib suahkcurD ,ztleB .rtsgreBthcabsmeH 012345-041315412 - detnirP no eerf-dica repap Contents Preface ,oo nl Overview of the Cray X-MP/Model 24 1 1.1 Introduction .................................. 1 1.2 The Cray X-MP/Model 24 Architecture .................. 2 The Control Section 4 2.1 The Instruction Cycle ............................ 4 2.2 The Instruction Issue Phase ......................... 5 2.3 The Instruction Buffers and the Instruction Fetch ............. 7 2.4 Instruction Execution ............................ 10 3 The Scalar Section 11 3.1 The Scalar Section Hardware ........................ 11 3.2 Execution in the Scalar Section ....................... 13 4 The Address Section 17 4.1 The Address Section Hardware ....................... 17 4.2 Addresses on the Cray X-MP ........................ 19 5 Vectors and Vector Operations 21 5.1 Basic Operation of the Vector Section ................... 21 5.2 Instruction Issue and Source Register Reservation ............. 25 5.3 Result Register Reservations and Chaining ................ 28 5.4 Vector Memory Operations Without Conflicts ............... 30 5.5 Vectorization ................................. 31 5.6 The Effect of Dependencies on Vectorization ................ 40 6 Memory Access 44 6.1 Memory Organization ............................ 44 6.2 Scalar Memory Transfers ........................... 44 6.3 Vector Transfers ............................... 47 6.4 Memory Conflicts for Vector Operations .................. 49 6.5 Bidirectional Memory Access ........................ 54 6.6 Instruction Fetches .............................. 56 VI 7 Interprocessor Communication and Multitasking 57 1.7 Introduction .................................. 57 7.2 Shared Memory on the Dual Processor System .............. 58 7.3 Hardware Support rof Interprocessor Communication ........... 58 7.4 The Test-And-Set Operation rof Mutual Exclusion ............ 59 7.5 Multiprogramming .............................. 61 7.6 The Busy Waiting Problem ......................... 62 7.7 Fortran Locks and Events .......................... 65 7.8 Fortran Tasks ................................. 67 7.9 Static Scheduling and Self-scheduling of Loops .............. 75 7.10 The Future .................................. 78 A PMS DiagrAm of the Cray X-MP I/O Subsystem 80 B Exchange Package for the Cray X-MP 82 C Lawrence Livermore Loops 85 D Sample Programs 88 D.1 Introduction .................................. 88 D.2 Examples from Chapter 3 ........................... 91 D.3 Examples from Chapter 5 .......................... 96 D.4 Examples from Chapters 6 and 7 ...................... 110 E Instruction Execution Summary for the Cray X-MP 125 F XMPSIM Users Manual 145 F.1 Introduction .................................. 145 F.2 Basic Simulator Operation .......................... 145 F.3 The Cray X-MP Instruction Cycle ..................... 148 F.4 Creating a Source Program ......................... 150 F.5 Invoking the Simulator ............................ 150 F.6 The Main Display ........................... . . . 153 F.7 The Configuration Menu ........................... 153 F.8 Description of the Assembly Phase ..................... 157 F.9 Tokenization of Cray Assembly Language Instructions .......... 158 F.10 Internal Data Structures ........................... 161 F.11 Data Types for Numeric Values ....................... 161 F.12 Simulator Restrictions ............................ 162 Bibliography 163 Preface This monograph examines the issues relevant to the design of vector and pipelined computer systems. The Cray X-MP/24 is used as a case study to examine how design tradeoffs affect performance. Enough technical details are provided so that a reader may work out timings for the Cray X-MP without reference to a hardware manual. We hope that a serious look at the details of the design will give the reader insights that a superficial discussion cannot yield. Our study left us with a great appreciation of the machine and an admiration for its designers. The insights we have given will be useful to the scientist who would like to obtain maximum performance from a vector machine, to the computer science student, and to the compiler writer. This monograph can also be used to supplement a regular textbook such as Baer ]2[ or Stone [44] in a graduate or senior level course in computer architecture. The book begins with an overview of the Cray X-MP system. Chapter 2 discusses vari- ous aspects of control including the instruction cycle, the management of the instruction buffers, and the instruction issue mechanism. The scMar section is examined in Chap- ter 3 and the addressing mechanism is examined in Chapter 4. Chapter 5 discusses vectorization and chaining. Chapter 6 looks at memory access and conflict resolution. Multi-tasking and interprocessor communication are introduced in Chapter .7 Appendix A gives a PMS diagram of the Cray X-MP, and Appendix B shows the exchange pack- age for the Cray X-MP. Appendix C lists the Lawrence Livermore loops, a standard benchmark for scientific computing. Appendix D shows a list of sample programs and discusses some of the more subtle aspects of performing accurate instruction timings. Appendix E contains a complete list of Cray assembly language instructions and their timings. Appendix F contains the Users Manual for XMPSIM, a Cray simulator which runs on an IBM PC and is available from the authors. The authors gratefully acknowledge the support of Cray Research through their Uni- versity of Texas System Grants Program. Computational support was provided by the University of Texas Center for High Performance Computing. Several people have read different versions of this book and made suggestions. We would particularly like to thank Nora Fangon, Alyson Thring, Warren Wayne, Luther Keeler, and Neal Wagner for their helpful comments. 1 Overview of the Cray X-MP/Model 24 1.1 Introduction While the architecture of parallel machines is evolving rapidly, the software tools which would allow the programmer to work in a machine-independent environment have been developing slowly. The intricacies of programming most parallel machines are well be- yond the tolerance, if not the capabilities, of the ordinary programmer or scientist[22]. The parallel programming environment has for the most part demanded of its users a detailed knowledge of the machine architecture and the operating system in order to achieve significant gains in performance. The pipelined vector processor is the first parallel architecture in which there has been a successful marriage of hardware and software to achieve supercomputer performance in an environment in which a programmer can take a somewhat machine-independent view. By a vector processor we mean a processor that can perform an operation on a one dimensional array of values in a single instruction. A pipelined processor is one in which the functional units are divided into stages and intermediate values are passed from stage to stage as the computation proceeds. When a result leaves one stage, that stage is able to work on the next result in the pipeline much as is done on an assembly line. Many applications in the physical sciences map naturally onto the vector processor architecture. A scientist who is willing to settle for moderate program performance can usually rely on the automatic vectorization facilities of the machine's Fortran compiler to perform this mapping in a relatively efficient way. New preprocessing programs such as Forge [16] and Paraphrase [32, ]63 allow experienced users to quickly optimize their code for vector processors. Even though substantial progress has been made in automatic vectorization, the ap- plications programmer is not completely free to take a machine independent view of program execution. For the student of computer architecture and the compiler writer, an understanding of the underlying architecture is even more important. In this monograph we examine the detailed architecture of the Cray X-MP, a pipelined vector processor which has set the standard in the supercomputer arena. We look at the detailed operation of each of the sections of the CPU and examine how the CPU interacts with memory to produce a high performance machine. An understanding of how the X- MP achieves maximum performance and when it fails due to conflicts can give important insights to those interested in vector processing. The discussion is very detailed, but such detail is unavoidable. Although many of the suggested exercises assume the reader already has a background in computer architecture and operating systems, most of the material does not assume a great deal of previous background. 2 1.2. The Cray X-MP/Model 42 Architecture CPU 0 CPU I CPU 1 noitatupmoC Communication Computation Section ! Section Section i CPU 0 CPU 1 Main Memory Control Control (4 million words) Section Section I • I/O Subsystem 1 Figure 1.1. Cray X-MP/Model 24 dual processor system. 2.1 The Cray X-MP/Model 42 Architecture The Cray X-MP is a pipelined vector processor manufactured by Cray Research Inc. There are a number of remarkable features about the architecture of this machine. The Cray X-MP shares many of the characteristics of RISC machines [34] including a load-and-store architecture, a short instruction cycle, and single cycle instructions. The Cray X-MP/Model 24 consists of two identical central processing units (CPUs), each with a 9.5 nanosecond cycle time. The processors share a common main memory and I/O section as shown in Figure 1.1. The control section for each processor manages the instruction buffers, issues in- structions, and controls the flow of information within other sections. There is a pro- grammable clocki n the control section which is used by the operating system to generate interrupts. The control section is discussed in Chapter 2. The Cray X-MP 9.5 nanosecond cycle time is very fast for the technology used. Most instructions can begin execution in a single machine cycle, and under the appropriate circumstances, can produce results on every machine cycle. The machine was designed with relatively simple control mechanisms in order to keep the instruction cycle short. While the Cray instruction scheduling mechanism does not give optimal overlap of instructions during execution, the hardware required to produce optimal control is suf- ficiently complex to require a slower cycle time [40]. This delicate trade-off between the length of the CPU cycle and the amount of work which can be accomplished in each cycle is a key to the Cray X-MP's performance. Each CPU has its own computation section consisting of registers and functional units. The registers include address registers, scalar registers, and vector registers. There are 13 pipelined functional units dedicated to performing specific integer and floating point operations. The computational sections are discussed in Chapters 3, 4, and 5. .1 otfhO ev erview Cray X-MP/Model 42 3 The main memory is shared by the two central processing units. It consists of 4 million words organized into 23 banks. Each word includes 46 data bits and 8 check bits. The check bits allow the system to detect double errors and correct single errors during memory accesses. The memory system is discussed in Chapter 6. The communication section includes 3 clusters of shared registers and semaphores used to arbitrate interprocessor communication and shared memory. A common real time clock is used to synchronize operations between processors. The communication section is discussed in Chapter 7. The I/O subsystem is delineated in the PMS diagram of Appendix A. It includes a solid-state storage device (SSD) which has a data transfer rate of 1250 Mbytes per second. There are between two and four I/O processors included in the system. The MIOP (master I/O processor) is required in all configurations. It manages the front-end in- terfaces and the console. The BIOP is also required. It handles transfers between main memory and the secondary storage devices. An optional DIOP (disk I/O processor) manages additional disk controllers, and an optional XIOP (auxiliary I/O processor) controls block multiplexer channels. The I/O processors have their own local memories and share a common buffer memory. The I/O subsystem channels to main memory are organized into 4 groups (0, ,1 2, and 3). Each group is scanned or polled by the memory switch once every fourth clock pulse. Within a group the lowest number channel has the highest priority. II EXERCISES 1.1 Discuss possible reasons why the Cray X-MP system does not have a cache. 2.1 The individual CPUs in the Cray X-MP system do not have any local memory. Instead, they share a common main memory. Under what circumstances might local CPU memories be useful? 3.1 The Cray Operating System (COS) does not support virtual memory. Instead, the programmer must use overlays to manage programs which do not fit entirely in memory. Discuss the advantages and disadvantages of this approach in the context of a computationally intensive environment. H II IIII I 2 The Control Section 1.2 The Instruction Cycle The basic instruction cycle in a standard von Neumann machine consists of the following steps: .1 Fetch the instruction. 2. Increment the program counter. 3. Decode the instruction. 4. Decode and fetch the operands. 5. Execute the instruction. 6. Check for interrupts. .7 Go to .I Steps 4-1 are designated as the issue instruction phase of the instruction .elcyc Step 5 si called the phase. execution An instruction si said usually to be issued at thpeo int when steps 4-1 are completed and the processor si ready to execute the .noitcurtsni However, the exact point of issue depends somewhat on the type of control scheme used. Various strategies can be used to speed up this cycle. For example, the program counter is usually provided with its own incrementer so that step 2 in the instruction cycle can be overlapped with steps 3 and 4. A significant gain in performance can be achieved by allowing the issue phase of one instruction to overlap with the execution phases of one or more of its predecessors. The instruction cycle is handled by the control section of the CPU. The control section issues instructions and supervises the execution of instructions and the flow of information through the CPU. In this chapter we will raise some important control design issues and discuss how these issues were resolved by the designers of the Cray X- MP. Some typical design questions which arise in the basic design of a machine include the following: .1 How many instructions can be issued in a single clock cycle? 2. Can instructions be issued out of order? 3. Should there be instruction buffers? If so, how should they be managed? 4. What are the allowable operands for arithmetic and logical operations? 5. How many instructions can be executed in parallel? 6. How should input and output traffic be controlled on the buses? .2 The Control Section 5 2.2 The Instruction Issue Phase The instruction issue phase includes the instruction fetch and the operand fetch. Since main memory access time is much greater than the CPU cycle time, the instruction fetch is a bottleneck in the instruction issue phase. Most high performance processors prefetch and buffer instructions in fast memory so that the processor will not be delayed by slow main memory accesses. The Cray X-MP handles this problem by providing four large instruction buffers. The detailed management of these buffers is discussed in Section 2.3. Another bottleneck in the instruction issue phase is the fetching of operands. The Cray X-MP uses a load and store architecture to eliminate this problem. In a load and store architecture, the only memory operations are load a register from memory or store a register value to memory. The functionM units (which perform the arithmetic, logicM, and other data manipulations) require register operands. In this way, most instructions do not experience a delay due to operand fetches from memory. The Cray X-MP is designed to issue one instruction on every clock cycle. Instructions cannot be issued out of order and the operands must be available (except in the case of vector chaining which is discussed in Chapter 5). This relatively simple design keeps the CPU cycle time short. More complex designs allow greater overlap and execution rates, but may require a longer processor cycle due to the greater amount of bookkeeping needed to keep track of instruction dependencies. (See Weiss and Smith [46] for a more detailed discussion of the possible options.) The hardware for the instruction issue is shown in Figure 2.1. Instructions are either 61 bits 1( parcel) or 23 bits 2( parcels) long. We will first consider the issue of 1-parcel instructions which are in the current instruction buffer. During the instruction fetch, the first parcel of an instruction is transferred from the instruction buffer to the NIP (Next Instruction Parcel Register). This process takes one clock cycle. On the next clock cycle the parcel is moved to the CIP (Current Instruction Parcel Register) where it is decoded. The instruction is held in the CIP until it can be issued, that is, until the processor is ready to execute it. The designation of exactly when an instruction issues can be a bit tricky as we shall in our discussion of certain vector operations in Chapter .5 We will say that the issue see takes place when the CIP is freed for the next instruction. A one-parcel instruction will need only one clock cycle in the CIP unless there is a conflict because a previously issued instruction is using resources which will be required during the execution of the current instruction. This situation is called a hold condition. Appendix E gives a list of all of the Cray X-MP instructions, their timings, and possible hold conditions which can delay the issue of each instruction.