Preface OpenVMS was developed in the 1970s to run on Digital Equipment Corporation's 32-bit XAV architecture. In the early 1990s Digital developed the 64-bit Alpha RISC architecture, and the OpenVMS code base was ported to Alpha. The book Open- VMS AXP Internals and Data Structures: Version 1.5 (1994) describes key executive components of an early version of OpenVMS Alpha. The present book describes the memory management subsystem of OpenVMS Alpha Version 7.3 and the system services that create, control, and delete virtual address space and sections. It emphasizes system data structures and their manipulation by paging and swapping routines. It also describes management of dynamic memory, such as nonpaged pool, and support for nonuniform memory access (NUMA) platforms. This book updates the memory management part of the Version 1.5 volume, as Open- VMS Alpha Internals: Scheduling and Process Control (1997), updated the scheduling and process control part. Neither update is wholly independent of the Version 1.5 vol- ume. Thus, to expand on topics mentioned here, chapters in this book refer to chapters in any of the three books. References to chapters within this book are by number, for example, Chapter .1 References to chapters in the preceding books are by title, for example, Chapter Synchronization Techniques. For conceptual background on internals topics not covered in this book, consult the Version 5.1 book and the Scheduling and Process Control volume. Although details such as data structure layouts will likely have changed since previous versions, much of the conceptual foundation of OpenVMS is unchanged. This book describes some features of the Alpha architecture, but presupposes knowl- edge of other features. The Alpha Architecture Reference Manual describes the architecture in detail. Conventions A number of conventions are used throughout the text and figures of this volume. During the life of the XAV VMS operating system, the exact form of its name has changed several times: from VAX/VMS Version 1.0 to XAV VMS Version 5.0 to Open- VMS XAV Version 5.5. In describing the evolution of VMS algorithms and discussing the foundation of the OpenVMS Alpha operating system, this book refers to the OpenVMS XAV operating system by whichever name is appropriate for the version referenced. The term executive refers to those parts of the operating system that are loaded into and that execute from system space. The executive includes the system base images, SYS$BASE_IMAGE.EXE and SYS$PUBLIC_VECTORS.EXE, and a number of other loadable executive images. Because there is no need to distinguish different types of executive image, this book generally shortens the term loadable executive image to executive image. Preface xxiii Preface The terms system and Open VMS system describe the entire OpenVMS software pack- age, including privileged processes, utilities, and other support software as well as the executive itself. The OpenVMS system consists of many different components, each a different file. Components include the system base images, executive images, device drivers, command language interpreters, and utility programs. The source modules from which these components are built and their listings are organized by facility. Each facility is a directory on a source or listing medium contain- ing sources and command procedures to build one or more components. The facility DRIVER, for example, contains sources for most of the device drivers. The facility SYSBOOT contains sources for the secondary bootstrap program, SYSBOOT. The facility SYS contains the sources that make up the base images and many executive images. This book identifies a SYS facility source module only by its file name. It identifies a module from any other facility by facility directory name and file name. For example, SYSGENSYSGEN refers to the source for the system generation utility (SYSGEN). Closely related routines and modules often have names that differ by only the last few characters. For brevity, this volume refers to two such routines by enclosing the last few characters in square brackets. For example, the name MMG$DALCSTXSCAN1 refers to routines MMG$DALCSTXSCAN and MMG$DALCSTXSCAN1. Almost all source modules are built so as to produce object modules and listing files of the same file name as the source module. Thus, a reference in this book to a source module name identifies the file name of the listing file as well. In a case where the two names differ, the text explicitly identifies the name of the listing file. Appendix Use of Listing and Map Files discusses how to locate a module in the source listings. This book identifies a macro from SYS$LIBRARY:LIB.MLB by only its name, for instance, $PHDDEF. The macro library of all other macros is specified. Ported from XAV VMS, many OpenVMS Alpha executive routines have JSB-type entry points. That is, they were originally written to be entered with a XAV 5ss instruction rather than a XAV SLLAC or GLLAC instruction. Typically this was done for performance reasons at a time when most of the executive was written in XAV MACRO and the rest in XAV BLISS. On an Alpha CPU, however, there is little difference in the code generated for a MACRO-32 5ss instruction and for a MACRO-32 SLLAC instruction. As part of adding support for high-level language device drivers and other sys- tem code, a standard call-type entry point has been added for each JSB-type entry point. An added call-type entry point has the string _STD in its name; for example, MMG$CREPAG_64 and MMG_STD$CREPAG_64 are the two entry points of a per- page routine called by system services such as $CRETVA and $CRETVA_64. New routines typically have only one entry point, a standard call-type entry point with a name that does not include the string _STD. The unmodified terms process control block and PCB refer to the software data struc- ture used by the executive. The data structure that contains a process's hardware context, the hardware privileged context block (HWPCB), is always called the HWPCB. xxiv Preface Preface The term inner access modes means those access modes with more privilege. The term outer access modes means those with less privilege. Thus, the innermost access mode is kernel and the outermost mode is user. The term SYSGEN parameter refers to a system cell that can be altered by a system manager to affect system operation. Traditionally, these parameters were altered through either the SYSGEN utility or SYSBOOT, the secondary bootstrap program. Although they can also now be altered through the SYSMAN utility or AUTOGEN command procedure, this volume continues to use the traditional term SYSGEN parameter. SYSGEN parameters include both dynamic parameters, which can be changed on the running system, and static parameters, whose changes do not take effect until the next system boot. These parameters are referred to by their parameter names rather than by the symbolic names of the global locations where their values are stored. Appendix SYSGEN Parameters and Their Locations relates parameter names to their corresponding global locations. The terms byte index, word index, longword index, and quadword index derive from methods of XAV operand access that use context-indexed addressing modes. That is, the index value is multiplied by 1, 2, 4, or 8 (for bytes, words, longwords, or quadwords, respectively) as part of operand evaluation, to calculate the effective address of the operand. Although the Alpha architecture does not include these addressing modes, the concept of context indexing is relevant to various OpenVMS Alpha data structures and tables. A term in small capital letters refers to the formal name of an argument to an Open- VMS system service, for example, the MANGOL argument. A bit field is sometimes described by its starting and ending bit numbers within angle brackets; for example, the interrupt priority level of the processor, in the processor status bits <12:8>, is contained in bits 8 through 12. Unless otherwise noted, numbers in the text are decimal. The term KB refers to a kilobyte, 1,024 bytes; the term MB, to a megabyte, 1,048,576 bytes; the term GB to a gigabyte, 1,024 MB; and the term TB, a terabyte, to 1,024 GB. Three conventions are observed for lists: (cid:12)9 In lists like this one, where no order or hierarchy exists, list elements are indicated by leading round bullets. Sublists without hierarchy are indicated by dashes. (cid:12)9 Lists that indicate an ordered set of operations are numbered. Sublists that indicate an ordered set of operations are lettered. (cid:12)9 Numbered lists with the numbers enclosed in circles indicate a correspondence between the list elements and numbered items in a figure or example. Several conventions are observed for figures. In all diagrams of memory, the lowest virtual address appears at the top of the page and addresses increase toward the bottom of the page. Thus, the direction of stack growth is depicted upward from the bottom of the page. In diagrams that display more detail, such as bytes within longwords, addresses increase from right to left. That is, the lowest addressed byte (or Preface xxv Preface bit) in a longword is on the right-hand side of a figure and the most significant byte (or bit) is on the left-hand side. Each field in a data structure layout is represented by a rectangle. In many figures, the rectangle contains the last part of the name of the field, excluding the structure name, data type designator, and leading underscore. A rectangle the full width of the diagram generally represents a longword regardless of its depth. A field smaller than a longword is represented in proportion to its size; for example, bytes and words are quarter- and half-width rectangles. A quadword is generally represented by a full- width rectangle with a short horizontal line segment midway down each side. In some figures, a rectangle the full width of the diagram represents a quadword. In these figures, bit position numbers above the top rectangle show numbers from 0 to 63 to indicate that the rectangle represents a quadword. For example, Figure 2.5 shows the layout of the fixed part of the process header (PHD). The rectangle labeled SIZE represents the word PHD$W_SIZE; the rectangle labeled WSLIST, the longword PHD$L_WSLIST; and the rectangle labeled NEXT_REGION_ ID, the quadword PHD$Q_NEXT_REGION_ID. In almost all data structure figures, the data structure's full-width rectangles represent longwords aligned on longword boundaries. In a few data structures, a horizontal row of boxes represents fields whose sizes do not total a longword. Without this practice, most of the fields in this kind of structure would be split into two part-width rectangles in adjoining rows, because they are unaligned longwords. Some data structures have alternative definitions for fields or areas within them. A field with multiple names is represented by a box combining the names separated by slash )/( characters. An area with multiple layouts is shown as a rectangle with a dashed line separating the alternative definitions. For example, in Figure 2.18, fields PFN$L_FLINK and PFN$L_SHRCNT are two names for the same field. Figure 2.18 also shows an example of alternative definitions for an area; the quadword at PFN$Q_ BAK is also divided into the longword PFN$L_PHD and PFN$L_COLOR_BLINK. A data structure field containing the address of another data structure in the same figure is represented by a bullet connected to an arrow pointing to the other structure. Where possible, the arrow points to the rightmost end of the field, that is, to bit 0. A field containing a value used as an index into that or another data structure is represented by an x connected to an arrow pointing to the indexed location. Two conventions indicate elisions in a data structure layout. A specific amount of space is shown as a rectangle whose sides contain dots. Text within the rectangle indicates the amount of space it represents. Field PHD$Q_PAL_RSVD in Figure 2.5, for example, represents 48 bytes. An indeterminate amount of space, often unnamed, representing omitted and unde- scribed fields, is indicated by a rectangle whose sides are intersected by short parallel horizontal lines. For example, Figure 2.1, which identifies only the PCB fields related to memory management, contains seven sets of omitted fields among the labeled fields. xxvi Preface Preface In a typical figure that represents a code flow, such as Figure 4.1, time flows down- ward. Each different environment in which the code executes is represented by a column in the figure. The headings above the columns identify the environment characteristics, for example, "Kernel Thread Context" or "Kernel Mode". A code flow figure represents only the events in the code most relevant to the current discussion. A description of code within a routine begins with the routine's name in bold-face type followed by text lines describing the routine's actions. When one routine calls another, the routine nesting is shown by indents. A lightning bolt represents an exception or interrupt. A diamond represents a branch test. An arrow indicates a transfer of control, typically from one routine to another. stnemgdelwonkcA In addition to acknowledging the work of the many contributors to Open VMS AXP Internals and Data Structures: Version 1.5, the precursor of the present series, I would like to acknowledge the contributors to the present volume. A number of people reviewed substantial portions of the book and made suggestions that improved its quality. I am particularly grateful to John Gillings, Mike Harvey, Andy Kiihnel, Arthur Lampert, Karen Noel, and Guy Peleg. I'd also like to thank Nitin Karkhanis and Bobbi Ketelsen for their review. I was delighted to work again with Alice Cheyer and Sarah Lemaire. Alice has been editing previous versions of this book since 1987 and shepherding them through pro- duction. Her punctilious attention to detail has greatly improved the book's consistency and readability. She challenges me to do my best work and keeps me honest. Sarah Lemaire contributed immeasurably to the composition, figures, and index of this book. She remained unflappable and cheerful in the face of seemingly endless problems in every aspect of the production. I am grateful to Bryan Jones, Group Engineering Manager for OpenVMS Engineering, for funding my work on this volume. I would also like to thank my managers for their support: Verell Boaen, Jim Janetos, and Joe Schuster. Carl Gallozzi and Margie Sherlock provided critical supplemental support. Pam Chester and Theron Shreve guided the book through its publication stages at Digital Press. Alan Rose aptly managed the prepress production; his wry sense of humor lightened the process. Lauralee Reinke capably dealt with font mysteries. Lazarus Guisso greatly improved the format of the book's pages. Preface xxvii Chapter 1 Fundamentals and Overview One must have a good memory to be able to keep the promises one makes. Friedrich Wilhelm Nietzsche, Human, All ooT Human Virtual memory support for the OpenVMS Alpha operating system is based upon Alpha architectural features. It is designed to provide (cid:12)9 Maximum compatibility with OpenVMS XAV memory management (cid:12)9 Access to the larger Alpha address space Support for memory shared among multiple OpenVMS instances running on a Galaxy platform (cid:12)9 Support for efficient operation of platforms with nonuniform memory access This chapter describes the Alpha memory management architecture and provides an overview of OpenVMS Alpha memory management. Sections 1.1-1.11 describe the fundamental concepts of memory management and the architectural mechanisms that underlie it. Sections 1.12-1.14 give an overview of OpenVMS management of virtual and physical memory. 1.1 Overview Physical memory is the real memory supplied by the hardware. A virtual memory environment supports software that has memory requirements greater than the available physical memory. An individual process can require more physical memory than is available, or the total requirements of multiple processes can exceed available physical memory. A virtual memory system simulates real memory by transparently moving the contents of memory to and from block-addressable mass storage, usually disks. An Alpha processor and the executive cooperate to support virtual memory. As used here, the term processor includes both the CPU hardware and its privileged architec- ture library (PALcode) address translation code. Fundamentals and Overview In normal operation, the processor interprets all instruction and operand addresses as virtual addresses (addresses in virtual memory) and translates virtual addresses to physical addresses (addresses in physical memory) as it executes instructions. This execution time translation capability enables the executive to execute an image in whatever physical memory is available. It also enables the executive and an Alpha processor in combination to restrict access to selected areas of memory, a capability known as memory protection. The term memory management describes not only virtual memory support but also the ways in which the executive exploits this capability. Memory management is fundamentally concerned with the following issues: Movement of code and data between mass storage and physical memory as re- quired to simulate a virtual memory larger than the physical one Support of memory areas in which individual processes can run without interfer- ence from others, areas in which system code can be shared but not modified by its users, and areas in which application code and data can be shared Arbitration among competing uses of physical memory to optimize system opera- tion and allocate memory equitably 2.1 Physical Memory snoitarugifnoC On a uniprocessor system, all the physical memory is associated with one CPU. Some of the physical memory is permanently occupied by executive code and data, and the rest is available for system processes and user applications. A symmetric multiprocessing (SMP) system is a hardware platform with two or more CPUs. Each can access all the physical memory and execute instructions indepen- dently of the others. As in a uniprocessor system, some of the physical memory is permanently occupied by a single copy of executive code and data, and the rest is available for other uses. Each CPU executes a different thread of execution, for exam- ple, an interrupt service routine or a kernel thread of some process. Executive code allocates physical memory, coordinates scheduling of the CPUs, and when necessary, synchronizes their operations. OpenVMS Alpha introduced support for Galaxy systems in Version 7.2. In a Galaxy system, multiple copies of OpenVMS execute within one multiprocessor computer. Each copy is called an instance. The system manager assigns each instance some of the computer's resources, in particular, memory, CPUs, and I/O peripherals. The system manager can reassign resources among the instances as needs change. With OpenVMS Alpha Version 7.3, only CPUs can be reassigned. The term soft partitioning describes this type of software-controlled separation of com- puting resources. In contrast, hard partitioning is a physical separation of computing resources by hardware-enforced barriers. 2.1 Physical yromeM, Configurations An instance ,can be a uniprocessor or an SMP system. In each instance, some physical memory is occupied by executive code and data, and the rest is available for other uses. Some physical memory is shared among all the instances for Galaxywide system data structures. Applications running on multiple instances can create Galaxywide global sections in shared memory. Executive code synchronizes its own access to executive Galaxywide data and provides mechanisms for applications to synchronize their access to Galaxywide sections. Figure 1.1 shows a simple representation of a Galaxy platform's CPUs and memory, which have been divided into three instances. Instance 0, for example, is a three- member SMP system. The executive occupies some of its private memory, and the rest is available for other uses. The three instances all share memory for executive data, and applications running on them can share global sections in shared memory. Figure 1.1 Example Galaxy Configuration Instance 0 Instance 1 Instance 2 sUPC CPUs sUPC Instance- Instance- Instance- etavirP etavirP etavirP yromeM yromeM yromeM derahS-evitucexE yromeM Application-Shared Memory Some newer platforms are made up of hardware components called system building blocks (SBBs). Each SBB can have CPUs, memory, and I/O adapters. The components within an SBB have similar access characteristics. On Alphaserver GS160 and GS320 systems, for example, the SBB is called a quad building block (QBB). On these systems a CPU can access physical memory in its own QBB more quickly than other memory. This phenomenon is called nonuniform memory access (NUMA). The system manager can configure the CPUs into a single SMP system running one OpenVMS instance or into a Galaxy system running multiple instances. In either case, if a single instance runs on multiple SBBs, the executive differentiates between memory local to a SBB and nonlocal memory to improve performance. For example, if a process is assigned to a particular SBB, the executive attempts to allocate its physical memory from memory local to that SBB. Section 1.7 describes another way in which OpenVMS supports NUMA platforms. Fundamentals and Overview A software grouping of hardware components with similar access characteristics is called a resource affinity domain (RAD). For example, on an Alphaserver GS160 or GS320 system, a RAD corresponds to a QBB. Figure 1.2 shows the CPUs and memory of an example GS160 configuration. The system manager has configured it as a Galaxy system running four OpenVMS instances. Instances 0 and 1 each correspond to a QBB. Instance 2, however, has CPUs and memory from QBB 2 and QBB 3 and thus makes use of two RADs. Instance 3's CPUs and memory are all from QBB 3. Figure 1.2 Example GS160 Configuration Instance 0 Instance 1 .................................................................................... RAD 0 RAD 1 O ........................................... "" .......... "L" .................... : ....................... NNN1 RAD 2 RAD 3 ................................................................................................... I Instance 2 Instance 3 3.1 Virtual Memory Concepts Virtual memory is implemented so that each process has its own address space. Some of the address space is private to a process, and some of it is common to all processes. Executive code and data occupy the common virtual memory, which is called system space. Virtual memory can be larger than physical memory. Support for virtual memory enables a process to execute an image that only partly resides in physical memory at any given time. Only the portion of virtual address space actually in use need occupy physical memory. This enables the execution of images larger than the available physical memory. It also makes it possible for parts of different processes' images and address spaces to be resident simultaneously even when they are in the same address range. Address references in an image built for a virtual memory system are independent of the physical memory in which the image actually executes. 3.1 Virtual Memory Concepts Physical memory consists of byte-addressable storage locations eight bits (one byte) long. Physical address space is the set of all physical addresses that identify unique physical memory storage locations and I/O space locations. A physical address can be transmitted by the processor over the processor-memory interconnect, typically to a memory controller. During normal operations, an instruction accesses memory using the virtual address of a byte. The processor translates the virtual address to a physical address using information provided by the operating system. The set of all possible virtual addresses is called virtual memory, or virtual address space. Virtual address space and physical memory are divided into units called pages. Virtual and physical pages are the same size. Each page is a group of contiguous bytes that starts on an address boundary that is a multiple of the page size in bytes. The page is the unit of address translation. An entire virtual page is always mapped to an entire physical page; addresses within one virtual page translate to addresses within the same physical page. The virtual page is also the unit of memory protection. Each virtual page has protection attributes specifying which access modes can read and write that page. Memory management is always enabled on an Alpha processor. The CPU hardware treats all instruction-generated addresses as virtual. Note, however, that kernel mode code can access a physical address directly by executing the instruction CALL_PAL STQP or CALL_PAL LDQP. The CPU hardware attempts to translate virtual addresses to physical addresses using a hardware component called a translation buffer (TB). A TB is a cache of previously translated addresses. Because a TB can be accessed and searched faster than a page table, address translation is first attempted through a TB lookup. If the TB does not include this particular translation, PALcode must access a set of software data structures called page tables to load the translation into the TB. Each process has its own set of page tables. Page tables provide a complete association of virtual to physical pages. A page table consists of page table entries (PTEs), each of which associates one page of virtual address space with its physical location, either in memory or on a mass storage medium. A PTE contains a bit called the valid bit, which, when set, means that the virtual page currently occupies some page of physical memory. A PTE whose valid bit is set contains the number of the physical page occupied by the virtual page. The physical page number, called a page frame number, consists of all the bits of the physical page's address except for those that specify the byte within the page. When a reference is made to a virtual address whose PTE valid bit is set, the processor uses the page frame number to transform the virtual address into a physical address. This transformation is called virtual address translation. When a reference is made to a virtual address whose PTE valid bit is clear, the processor cannot perform address translation and instead generates a translation- not-valid exception, more commonly known as a page fault. The page fault exception