Preface A long, long time ago, Jim Lola approached me to write a book on the TruCluster Server product- I think that I was still young at the time. Now, many grey hairs later, I sit here on my recliner, my feet up, and my trusty (albeit quirky) laptop on my lap trying to put into words what an adventure it has been to create the book you now hold in your hands. Once upon a time we envisioned a nice, small, compact handbook- as you can see, we have no apparent concept of what small and compact are. So while it may not be compact, it is as complete as we could make it. Jim, Dennis, Greg, Brad, and I attempted to cover as many aspects of the TruCluster Server product as possible while keeping the book under one million pages and not delving too deeply into the nether-workings of the product's implementation. This book primarily focuses on version 5.1A of the TruCluster Server product, although we have tried to point out differences between V5.1A and the two previous version 5 releases. As a bonus (and due to our tendency to provide more rather than less information), some chapters include version 5.1B information as well. As the cover indicates, this book reflects the collaborative effort of five authors, so although we aimed for a consistent look and feel, you will no doubt pick up on each author's individual voice and writing style. Jim was the primary author of chapters ,1 4, 5, 10, 11, 25, and appendix B; Dennis took the lead on chapters 2, 16, 18, ,91 and 20; Greg was in charge of chapters 21, 22, 26, and appendix A; Brad had primary responsibility for chapter 3; and I wrote chapters 6- 9, 12- ,51 17, 23, and 24 as well as batting cleanup and pitching in out of the bullpen. It is our sincere hope that you read this book, find it indispensable, dog-ear many, many pages, and end up buying copies for every member of your family, friends, co-workers, and the neighbor next door. Enjoy! Scott Fafrak Chief Cook and Bottle Washer November 2002 stnemgdelwonkcA Welcome to the part of the book where we get a chance to thank and acknowledge those individuals who gave of their time and energy, willingly and quite often enthusiastically, to make this book so much better than it would have been without them. The behind-the-scenes efforts of the following people will not soon be forgotten. Thank you so very much for your technical expertise, encouragement, and in some cases, bad jokes: Dick Buttlar, Jan Mark Holzer, Mike Schloss, Ernie Heinrich, Chris Jankowski, Thomas Sjolshagen, Christian Klein, Fred Knight, Tom Smith, Alan Brunelle Diane Lebel, Roy Takai, Lorrin Lee, Jem Treadwell, Greg Brown, Bruce Lutz, Susan Verhulst, Wayne Cardoza, Maria Maggio, Pelle Wahlstrom, Tim Donar, Dan McGraw, John Williams, Bruce Ellis, Scooter Morris, Laurel Zolfonoon, Tom Ferrin, John Mundt, and the Systems Sergey Gleizer, Janice Parker, Operations and the Steve Gonzalez, Lauralee Reinke, Computing Technologies Bob Grosso, Alan Rose, Groups at Genentech. Travis Gummels, Susan Rundbaken, Also, to the many who contributed through your encouragement, support, comic relief, poetry, and tolerance during this project: Pam Chester, Chris Manley, Robert Trouard Jr., Cary Cose, Scott Manley, Abigail Yates, Marcie Dark, Tracy McTernan, Cherith Yates, Nancy King, Cecelia Nichols, Gail Yates, Evan Lola, Daniel Nichols, Horace Yates, Karl Lola, Timothy Nichols, and the #unix-chat gang Lauren Lola, Theron Shreve, Thank you. And finally, to our amazingly supportive and downright wonderful wives, Kris Fafrak, Kim Lola, Cheryl Dyment, Beth Yates, and Susan Nichols, your patience and encouragement mean more than words can convey. Scott Jim Den Greg Brad Fore, . word In March, 2000, those of us on the TruCluster project team were gratified to see our TruCluster Server product receive first place in 3 of 6 categories in D.H. Brown Associates, Inc.'s Competitive Analysis of Cluster Functionality. For Digital Equipment Corporation's engineers in Nashua, New Hampshire; Manalapan, New Jersey; Bellevue, Washington; and Galway, Ireland, taking top score in the categories of Cluster Concurrent Database Access, Cluster High-Availability Administration, and Cluster Single-System Image spelled a satisfying conclusion to more than three years of development. Unique technologies, such as a cluster file system (CFS) that enables a fully shared root file system and single file system namespace, the distributed lock manager (DLM), clusterwide graphic and command-line management interfaces, and the cluster application availability (CAA) failover framework, are integral to the success of the TruCluster technology. The TruCluster Server product has become the preferred high availability solution for industries that need continuous operation or require large numbers of compute cycles, such as biotech, mobile telephony, and information services. (When the human genome was decoded, TruCluster was there!) Now, a few years and a couple of corporate acquisitions later, the TruCluster engineers are smarter, dressed more sensibly, and devastatingly attractive. Significant portions of the TruCluster technology have made their way into the Oracle 9i Real Application Cluster (RAC) product. We are now busy sustaining and improving TruCluster technology on HP's Tru64 UNIX/Alpha platform and porting it to the HP-UX/Itanium platform. (Coincidentally, in the same D.H. Brown analysis, HP's cluster product, MC/ServiceGuard, took top spot in two of the remaining three categories. Coupled with the technology's VAXcluster heritage, the new HP-UX cluster product will have strong bloodlines.) Scott, Jim, Greg, Brad, and Dennis have collectively logged over three hundred thousand hours of cluster time and, although not devastatingly attractive, they are decidedly not unattractive. Each of them has woken up in the middle of the night in recent weeks, reciting the cluster boot messages in exact order in their entirety. Seriously, these individuals have been on the front lines representing TruCluster products to customers for years, relaying ideas, as well as complaints, to engineering. They've assembled the tools and solutions they describe in this book while on active duty helping real customers configure and maintain large cluster configurations. They personify the care, commitment, and expertise that the TruCluster project team has put into the product and are trustworthy guides to the technology. Dick Buttlar, Senior Member of Technical Staff Hewlett-Packard Company, Enterprise Systems Group, Business Critical Systems l.nt, r_oductt o, " n Imagine that you are the Data Center Manager or the Systems Manager for your company. Okay, quit imagining, you probably are, or may soon be, if you've already purchased this book. Every day, you are faced with challenges. One of the biggest challenges is keeping all your company's corporate critical applications available to your users, twenty-four hours per day, seven days per week, and 365 days per year. Another challenge is ensuring that all your systems are performing optimally, all the time, and will scale as your company grows. Finally, all of your systems have to be easy to manage by your existing staff- you know, do more with less. This scenario probably sounds like someone higher up in the corporate "food chain" is saying, "I want my cake and I want to eat it too, and by the way, it must taste marvelous!" But that is the reality we face in a world where timely information is the key to corporate success. While there are no guarantees that you will be the next CIO, the information to be gained by reading this book should go a long way in furthering your understanding of how to use the TruCluster Server product-and Clustering technology in general-to help you meet your primary challenges. Who knows, implementing TruCluster Server may even get you that well-deserved promotion. 1.1 What is a Cluster? If this were a science fiction novel, then when we refer to clusters, we would be referring to stars. If this were a book on wine, then we would be referring to grapes. The term clusters has been bandied about and used to mean many things in Information Technology. With this in mind, let's define what clusters are. According to one definition in the Merriam Webster Dictionary (the online version of course) a cluster is "a number of similar individuals that occur together." Well, that's not quite fight in terms of what a computer cluster is, but it' s a good start. For a more precise definition of what a cluster is, try this: "A cluster is a type of parallel or distributed computer system that forms, to varying degrees, a single, unified resource composed of several interconnected computers. Each interconnected computer has one or more processors, I/O capabilities, an operating-system kernel, and memory. 1'' What differentiates clustering from distributed computing is that with a cluster, a number of similar computers work together and form a cadre. This unifying relationship becomes the basis for providing the primary themes in clustering: increased application availability (or high availability), load balancing for scalability and performance, and ease of manageability. This book provides a detailed description of how to create real, single-system image clusters from individual UNIX servers using Compaq's Tru64 UNIX operating system and TruCluster Server software. 1.2 Overview of UNIX Cluster Types There are three basic types of UNIX clusters: the Failover Cluster, the Single System Image (SSI) Application Cluster, and the Single Systems Image (SSI) Systems Cluster. There would be four different types of UNIX clusters if you count Linux or Beowulf Clustering, but the discussion in this book is limited to the first three types listed above. 1.2.1 Failover Cluster The Failover Cluster is currently the most common form of UNIX clustering. In its many incarnations and flavors, it is available from most major computer hardware and system software vendors. While the main purpose of the failover cluster is high availability of applications, it is generally considered the most difficult to configure and manage due to the customizations required of application failover scripts. In looking at a failover cluster from a hardware perspective, it usually has some kind of interconnect between cluster nodes, access to a common disk or storage subsystem from each node, and a network failover capability. Failover of applications is accomplished through scripts that start and stop the applications during cluster node failure and recovery. Even though each node of a failover cluster is closely coupled through hardware, for the most part it is considered "shared-nothing" from a systems standpoint. Each cluster node must have its own copy of the operating system, and there can be no simultaneous access of disks or of memory between cluster nodes. The difficulty in developing good, robust application failover scripts usually has to do with the timing and the synchronization for startup or failover of applications versus the common disk subsystem accessibility. 1 "Clusters: Moving Beyond Failover," by Bruce Walker, August 1998, UNIX Review.com Chapter I 2.2.1 Single System Image (SSI) Application Cluster The principal difference between a SSI application cluster and a failover cluster is the application software. The application software must not only be "cluster-aware" but also "parallelized" to operate on each node of the cluster at the same time. These multiple components of the application are presented as one application to the users and the application administrator. The most well-known application, providing a single view of the application and its data, is Oracle Parallel Server (OPS) from the Oracle Corporation. For the most part, a SSI application cluster usually consists of the application, like OPS, and a fully configured failover cluster. All nodes of the failover cluster would run the application software, and failover scripts would control the actual failover of the application software. 3.2.1 Single System Image (SSI) Systems Cluster It is generally accepted that adding more Single System Image features increases availability, performance and scalability, and manageability to a cluster. However, there has been a great deal of disagreement, among computer hardware and systems manufacturers, regarding which SSI features constitute a full SSI systems cluster. Not surprisingly, each manufacturer believes they are correct in how they define what constitutes a SSI systems cluster, regardless of what the customers think. From a hardware perspective, again, there is little or no difference between a SSI systems cluster and a failover cluster. The real differences come in the software. Features that are essential to any SSI systems cluster include the following: (cid:12)9 SSI device access - a common view and access to all storage devices. (cid:12)9 A cluster file system - a common view and access to the entire file system hierarchy. (cid:12)9 A cluster alias or cluster Internet Protocol (IP) addressing - clients view the cluster as one system. (cid:12)9 SSI systems management - the cluster is managed like a single system. As these features provide the core functionality of any SSI cluster, they must be available prior to the addition of any other SSI related feature(s). Additional SSI systems cluster features are: Batch-load leveling - allows for the running of certain processes on the least loaded cluster node. SSI interprocess communications (IPC) - allows for a single name space and the sharing of standard IPC capabilities like pipes, semaphores, and shared memory. SSI process management - allows for a single namespace for processes. Dynamic load balancing- along with SSI process management, allows for process relocation between cluster nodes thereby dynamically balancing the load on the cluster as a whole. Chapter 1 Inclusion of one or all of these additional SSI systems cluster features may or may not determine whether it is a full or partial SSI systems cluster. Rather than have a computer manufacturer's marketing department dictate what is a full or partial SSI systems cluster, we believe that the definition depends on whether the cluster's functionality meets the requirements that you, the user, have for a full or partial SSI cluster. The three themes to consider for any type of cluster solution are high availability, performance and scalability, and manageability. 1.3 Evolution of TruCluster Server While we will not delve into the complete history of the clustering of computer systems - we'll leave that for another book- we will provide a brief history of clustering as it pertains to the evolution of the TruCluster Server product. In 1982, Digital Equipment Corporation (Digital)- or as many users still fondly remember as DEC- introduced the first commercially viable cluster: the VAXCluster. What made the VAXCluster such a success was that it was technologically the most complete general-purpose cluster on the market. It was a full feature implementation of a SSI systems cluster and, as many would say, "very cool stuff." Later, as Digital started producing Alpha AXP based VMS systems, they extended the capabilities of VAXCluster to allow for a heterogeneous mix of Alpha AXP and VAX based systems. The product name was also changed from VAXCluster to OpenVMS Cluster to reflect the heterogeneity of the product and the inclusion of the new POSIX open systems standards into VMS. In 1994, Digital announced the first commercially available UNIX based cluster- the DECsafe Available Server Environment (ASE) version 1.0. DECsafe ASE was a failover cluster but instead of using a cluster interconnect for intra-cluster communications, it used the existing TcP/IP based network. Access to common storage was the initial paradigm with DECsafe ASE. See Figure 1-1. Over the next couple of years, DECsafe ASE was improved to support additional Alpha-based systems and to add greater functionality in line with customers' demands and expectations for the product. This also provided an opportunity to create a solid foundation for the next step in the evolution of clustering on Digital UNIX. See Table 1-1, Cluster Chronology. When Digital shipped the TruCluster Software version 1.0 product in 1996, it saw the introduction of MEMORY CHANNEL as the cluster interconnect. This was the next step towards achieving what OpenVMS Clusters already had: the SSI systems cluster paradigm. What made the TruCluster Software version 1.0 product truly unique compared to DECsafe ASE was that it was the first UNIX-based cluster product to include support for cluster-aware applications. This support for cluster-aware applications basically allowed for the creation of SSI application clusters. Late 1996 saw the creation of a new TruCluster (TCR) Product umbrella consisting of three functionally overlapping yet distinct products: TruCluster Available Server (ASE) version 1.4, TruCluster Production Server (PS) version 1.4, and TruCluster Memory Channel Software (MC) version 1.4. Chapter 1 ASE-style 1-1: Figure retsulC ASE was the failover cluster product. PS was the natural extension to ASE and the SSI application cluster product. MC allowed users to write applications to take advantage of the new cluster interconnect- very attractive from the standpoint of high performance technical computing (HPTC). Of these three TruCluster products, PS and MC required the use of the MEMORY CHANNEL interconnect. The next couple of years brought further evolutionary advances in the ASE, TCR, and MC software to provide support for new Alpha-based server hardware and new customer-centric features like shared Tape access, online service 2 modification, and Year 2000 Readiness. 1998 to 1999 was a watershed year in which we saw many things change, yet stay the same. Digital Equipment Corporation was acquired by Compaq Computer Corporation (Compaq), and the product name changed from Digital UNIX to Compaq' s Tru64 UNIX. In 1999, Compaq released TCR version 1.6, which offered many enhancements but nothing really new in terms of clustering technology. The enhancements included support for Enhanced Security (C2) ,3 NetRAIN ,4 NFS over TCP/IP, Switched Fibre Channel, and MEMORY CHANNEL .2 2 A "service" in the old ASE/TCR days is similar to a Cluster Application Availability (CAA) resource in TruCluster Server with the exception that you had to consider both the failover scripts and the associated storage for the service. CAA is covered in Chapters 23 and 24. 3 C2 is a security level for computer systems and is defined by the U.S. Computer Security Center's "Orange Book." 4 NetRAIN (Redundant Array of Independent Network interface controllers) is discussed in greater detail in Chapter 9. retpahC 1 Table "1-1 Cluster Chronology Chapter I Later that same year, Compaq quietly released TruCluster Server version 5.0 as a limited release to a select group of customers. TruCluster Server version 5.0 was the very first version of TruCluster Server to have SSI systems cluster features. From a UNIX perspective, this version of TruCluster Server was no longer evolutionary but revolutionary! It was revolutionary to be able to write a file from one server to a common cluster file system and then be able to read this same file almost instantaneously from another server. 2000 signaled the release of TruCluster Server version 5.0A and later TruCluster Server version 5.1 to customers. This was the first general release of TruCluster Server software that had SSI systems cluster features. As of the release of TruCluster Server version 5.0A, any new version of TruCluster Server software will release with any new version of Tru64 UNIX. Tru64 UNIX version 5.1A and TruCluster Server version 5.1A was released in the fall of 2001. That about brings us to the present (Summer, 2002). As of this writing, we expect the release of Tru64 UNIX version 5.1B and TruCluster Server version 5.1B in the fall of 2002. 1.4 What is TruCluster Server? Now that we know what a cluster is, what is TruCluster Server? TruCluster Server is an amalgam of Tru64 UNIX software, storage devices, cluster interconnects, and two or more AlphaServer systems that operate together as a single virtual system. Each cluster member can share resources, storage, and cluster-wide file systems under a single systems management domain. TruCluster Server versions 5.0A, 5.1, and 5.1A provide the following features: (cid:12)9 Cluster-wide namespace. (cid:12)9 Cluster Communications Interconnect. (cid:12)9 Cluster-wide access to disk and tape storage. (cid:12)9 Distributed Lock Manager (DLM) .5 (cid:12)9 Cluster-wide Logical Storage Manager (LSM). (cid:12)9 Single-systems management. (cid:12)9 Connection Manager .6 (cid:12)9 Single Security Domain. (cid:12)9 Cluster application availability (CAA). (cid:12)9 Rolling Upgrade/Patch. (cid:12)9 Cluster alias. (cid:12)9 Expanded Process IDs (PIDs). (cid:12)9 Highly available NFS server using cluster alias. Again, with TruCluster Server, you are creating real single-systems image systems clusters from individual Tru64 UNIX servers. We will provide a more in-depth overview of Tru64 UNIX and TruCluster Server in Chapter 2. 5 Distributed Lock Manager is discussed in detail in Chapter 18. 6 Connection Manager is discussed in detail in Chapter .71 Chapter 1