Cloud Application Architectures George Reese Beijing • Cambridge • Farnham • Köln • Sebastopol • Taipei • Tokyo Cloud Application Architectures by George Reese Copyright © 2009 George Reese. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safari.oreilly.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or [email protected]. Editor: Andy Oram Indexer: Joe Wizda Production Editor: Sumita Mukherji Cover Designer: Mark Paglietti Copyeditor: Genevieve d'Entremont Interior Designer: David Futato Proofreader: Kiel Van Horn Illustrator: Robert Romano Printing History: April 2009: First Edition. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Cloud Application Architectures and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. ISBN: 978-0-596-15636-7 [V] 1238076149 CO NTEN TS PREFACE vii 1 CLOUD COMPUTING 1 The Cloud 2 Cloud Application Architectures 7 The Value of Cloud Computing 10 Cloud Infrastructure Models 17 An Overview of Amazon Web Services 19 2 AMAZON CLOUD COMPUTING 25 Amazon S3 25 Amazon EC2 29 3 BEFORE THE MOVE INTO THE CLOUD 47 Know Your Software Licenses 47 The Shift to a Cloud Cost Model 49 Service Levels for Cloud Applications 54 Security 63 Disaster Recovery 65 4 READY FOR THE CLOUD 67 Web Application Design 67 Machine Image Design 75 Privacy Design 80 Database Management 87 5 SECURITY 99 Data Security 99 Network Security 106 Host Security 113 Compromise Response 118 6 DISASTER RECOVERY 119 Disaster Recovery Planning 119 Disasters in the Cloud 122 Disaster Management 132 7 SCALING A CLOUD INFRASTRUCTURE 137 Capacity Planning 137 Cloud Scale 145 A AMAZON WEB SERVICES REFERENCE 153 v B GOGRID 173 by Randy Bias C RACKSPACE 181 by Eric Johnson INDEX 185 vi CONTENTS CH APTER O NE Cloud Computing THE HALLMARK OF ANY BUZZWORD is its ability to convey the appearance of meaning without conveying actual meaning. To many people, the term cloud computing has the feel of a buzzword. It’s used in many discordant contexts, often referencing apparently distinct things. In one conversation, people are talking about Google Gmail; in the next, they are talking about Amazon Elastic Compute Cloud (at least it has “cloud” in its name!). But cloud computing is not a buzzword any more than the term the Web is. Cloud computing is the evolution of a variety of technologies that have come together to alter an organization’s approach to building out an IT infrastructure. Like the Web a little over a decade ago, there is nothing fundamentally new in any of the technologies that make up cloud computing. Many of the technologies that made up the Web existed for decades when Netscape came along and made them accessible; similarly, most of the technologies that make up cloud computing have been around for ages. It just took Amazon to make them all accessible to the masses. The purpose of this book is to empower developers of transactional web applications to leverage cloud infrastructure in the deployment of their applications. This book therefore focuses on the cloud as it relates to clouds such as Amazon EC2, more so than Google Gmail. Nevertheless, we should start things off by setting a common framework for the discussion of cloud computing. 1 The Cloud The cloud is not simply the latest fashionable term for the Internet. Though the Internet is a necessary foundation for the cloud, the cloud is something more than the Internet. The cloud is where you go to use technology when you need it, for as long as you need it, and not a minute more. You do not install anything on your desktop, and you do not pay for the technology when you are not using it. The cloud can be both software and infrastructure. It can be an application you access through the Web or a server that you provision exactly when you need it. Whether a service is software or hardware, the following is a simple test to determine whether that service is a cloud service: If you can walk into any library or Internet cafe and sit down at any computer without preference for operating system or browser and access a service, that service is cloud-based. I have defined three criteria I use in discussions on whether a particular service is a cloud service: • The service is accessible via a web browser (nonproprietary) or web services API. • Zero capital expenditure is necessary to get started. • You pay only for what you use as you use it. I don’t expect those three criteria to end the discussion, but they provide a solid basis for discussion and reflect how I view cloud services in this book. If you don’t like my boiled-down cloud computing definition, James Governor has an excellent blog entry on “15 Ways to Tell It’s Not Cloud Computing,” at http://www.redmonk.com/ jgovernor/2008/03/13/15-ways-to-tell-its-not-cloud-computing. Software As I mentioned earlier, cloud services break down into software services and infrastructure services. In terms of maturity, software in the cloud is much more evolved than hardware in the cloud. Software as a Service (SaaS) is basically a term that refers to software in the cloud. Although not all SaaS systems are cloud systems, most of them are. SaaS is a web-based software deployment model that makes the software available entirely through a web browser. As a user of SaaS software, you don’t care where the software is hosted, what kind of operating system it uses, or whether it is written in PHP, Java, or .NET. And, above all else, you don’t have to install a single piece of software anywhere. Gmail, for example, is nothing more than an email program you use in a browser. It provides the same functionality as Apple Mail or Outlook, but without the fat client. Even if your domain does not receive email through Gmail, you can still use Gmail to access your mail. 2 CHAPTER ONE SalesForce.com is another variant on SaaS. SalesForce.com is an enterprise customer relationship management (CRM) system that enables sales people to track their prospects and leads, see where those individuals sit in the organization’s sales process, and manage the workflow of sales from first contact through completion of a sale and beyond. As with Gmail, you don’t need any software to access SalesForce.com: point your web browser to the SalesForce.com website, sign up for an account, and get started. SaaS systems have a few defining characteristics: Availability via a web browser SaaS software never requires the installation of software on your laptop or desktop. You access it through a web browser using open standards or a ubiquitous browser plug-in. Cloud computing and proprietary desktop software simply don’t mix. On-demand availability You should not have to go through a sales process to gain access to SaaS-based software. Once you have access, you should be able to go back into the software any time, from anywhere. Payment terms based on usage SaaS does not need any infrastructure investment or fancy setup, so you should not have to pay any massive setup fees. You should simply pay for the parts of the service you use as you use them. When you no longer need those services, you simply stop paying. Minimal IT demands If you don’t have any servers to buy or any network to build out, why do you need an IT infrastructure? While SaaS systems may require some minimal technical knowledge for their configuration (such as DNS management for Google Apps), this knowledge lays within the realm of the power user and not the seasoned IT administrator. One feature of some SaaS deployments that I have intentionally omitted is multitenancy. A number of SaaS vendors boast about their multitenancy capabilities—some even imply that multitenancy is a requirement of any SaaS system. A multitenant application is server-based software that supports the deployment of multiple clients in a single software instance. This capability has obvious advantages for the SaaS vendor that, in some form, trickle down to the end user: • Support for more clients on fewer hardware components • Quicker and simpler rollouts of application updates and security patches • Architecture that is generally more sound The ultimate benefit to the end user comes indirectly in the form of lower service fees, quicker access to new functionality, and (sometimes) quicker protection against security holes. However, because a core principle of cloud computing is a lack of concern for the underlying architecture of the applications you are using, the importance of multitenancy is diminished when looking at things from that perspective. CLOUD COMPUTING 3 As we discuss in the next section, virtualization technologies essentially render the architectural advantages of multitenancy moot. Hardware In general, hardware in the cloud is conceptually harder for people to accept than software in the cloud. Hardware is something you can touch: you own it; you don’t license it. If your server catches on fire, that disaster matters to you. It’s hard for many people to imagine giving up the ability to touch and own their hardware. With hardware in the cloud, you request a new “server” when you need it. It is ready as quickly as 10 minutes after your request. When you are done with it, you release it and it disappears back into the cloud. You have no idea what physical server your cloud-based server is running, and you probably don’t even know its specific geographic location. THE BARRIER OF OLD EXPECTATIONS The hardest part for me as a vendor of cloud-based computing services is answering the question, “Where are our servers?” The real answer is, inevitably, “I don’t know—somewhere on the East Coast of the U.S. or Western Europe,” which makes some customers very uncomfortable. This lack of knowledge of your servers’ location, however, provides an interesting physical security benefit, as it becomes nearly impossible for a motivated attacker to use a physical attack vector to compromise your systems. The advantages of a cloud infrastructure Think about all of the things you have to worry about when you own and operate your own servers: Running out of capacity? Capacity planning is always important. When you own your own hardware, however, you have two problems that the cloud simplifies for you: what happens when you are wrong (either overoptimistic or pessimistic), and what happens if you don’t have the expansion capital when the time comes to buy new hardware. When you manage your own infrastructure, you have to cough up a lot of cash for every new Storage Area Network (SAN) or every new server you buy. You also have a significant lead time from the moment you decide to make a purchase to getting it through the procurement process, to taking delivery, and finally to having the system racked, installed, and tested. What happens when there is a problem? Sure, any good server has redundancies in place to survive typical hardware problems. Even if you have an extra hard drive on hand when one of the drives in your RAID array 4 CHAPTER ONE fails, someone has to remove the old drive from the server, manage the RMA,* and put the new drive into the server. That takes time and skill, and it all needs to happen in a timely fashion to prevent a complete failure of the server. What happens when there is a disaster? If an entire server goes down, unless you are in a high-availability infrastructure, you have a disaster on your hands and your team needs to rush to address the situation. Hopefully, you have solid backups in place and a strong disaster recovery plan to get things operational ASAP. This process is almost certainly manual. Don’t need that server anymore? Perhaps your capacity needs are not what they used to be, or perhaps the time has come to decommission a fully depreciated server. What do you do with that old server? Even if you give it away, someone has to take the time to do something with that server. And if the server is not fully depreciated, you are incurring company expenses against a machine that is not doing anything for your business. What about real estate and electricity? When you run your own infrastructure (or even if you have a rack at an ISP), you may be paying for real estate and electricity that are largely unused. That’s a very ungreen thing, and it is a huge waste of money. None of these issues are concerns with a proper cloud infrastructure: • You add capacity into a cloud infrastructure the minute you need it, and not a moment sooner. You don’t have any capital expense associated with the allocation, so you don’t have to worry about the timing of capacity needs with budget needs. Finally, you can be up and running with new capacity in minutes, and thus look good even when you get caught with your pants down. • You don’t worry about any of the underlying hardware, ever. You may never even know if the physical server you have been running on fails completely. And, with the right tools, you can automatically recover from the most significant disasters while your team is asleep. • When you no longer need the same capacity or you need to move to a different virtual hardware configuration, you simply deprovision your server. You do not need to dispose of the asset or worry about its environmental impact. • You don’t have to pay for a lot of real estate and electricity you never use. Because you are using a fractional portion of a much beefier piece of hardware than you need, you are maximizing the efficiency of the physical space required to support your computing needs. Furthermore, you are not paying for an entire rack of servers with mostly idle CPU cycles consuming electricity. * Return merchandise authorization. When you need to return a defective part, you generally have to go through some vendor process for returning that part and obtaining a replacement. CLOUD COMPUTING 5 Hardware virtualization Hardware virtualization is the enabling technology behind many of the cloud infrastructure vendors offerings, including Amazon Web Services (AWS).† If you own a Mac and run Windows or Linux inside Parallels or Fusion, you are using a similar virtualization technology to those that support cloud computing. Through virtualization, an IT admin can partition a single physical server into any number of virtual servers running their own operating systems in their allocated memory, CPU, and disk footprints. Some virtualization technologies even enable you to move one running instance of a virtual server from one physical server to another. From the perspective of any user or application on the virtual server, no indication exists to suggest the server is not a real, physical server. A number of virtualization technologies on the market take different approaches to the problem of virtualization. The Amazon solution is an extension of the popular open source virtualization system called Xen. Xen provides a hypervisor layer on which one or more guest operating systems operate. The hypervisor creates a hardware abstraction that enables the operating systems to share the resources of the physical server without being able to directly access those resources or their use by another guest operating system. A common knock against virtualization—especially for those who have experienced it in desktop software—is that virtualized systems take a significant performance penalty. This attack on virtualization generally is not relevant in the cloud world for a few reasons: • The degraded performance of your cloud vendor’s hardware is probably better than the optimal performance of your commodity server. • Enterprise virtualization technologies such as Xen and VMware use paravirtualization as well as the hardware-assisted virtualization capabilities of a variety of CPU manufacturers to achieve near-native performance. Cloud storage Abstracting your hardware in the cloud is not simply about replacing servers with virtualization. It’s also about replacing your physical storage systems. Cloud storage enables you to “throw” data into the cloud and without worrying about how it is stored or backing it up. When you need it again, you simply reach into the cloud and grab it. You don’t know how it is stored, where it is stored, or what has happened to all the pieces of hardware between the time you put it in the cloud and the time you retrieved it. As with the other elements of cloud computing, there are a number of approaches to cloud storage on the market. In general, they involve breaking your data into small chunks and storing that data across multiple servers with fancy checksums so that the data can be retrieved †Other approaches to cloud infrastructure exist, including physical hardware on-demand through companies such as AppNexus and NewClouds. In addition, providers such as GoGrid (summarized in Appendix B) offer hybrid solutions. 6 CHAPTER ONE