MorganKaufmannPublishersisanimprintofElsevier 30CorporateDrive,Suite400,Burlington,MA01803,USA Copyright#2009byElsevierInc.Allrightsreserved. Designationsusedbycompaniestodistinguishtheirproductsareoftenclaimedastrademarksor registeredtrademarks.InallinstancesinwhichMorganKaufmannPublishersisawareofaclaim,the productnamesappearininitialcapitalorallcapitalletters.Alltrademarksthatappearorare otherwisereferredtointhisworkbelongtotheirrespectiveowners.NeitherMorganKaufmann Publishersnortheauthorsandothercontributorsofthisworkhaveanyrelationshiporaffiliation withsuchtrademarkownersnordosuchtrademarkownersconfirm,endorseorapprovethe contentsofthiswork.Readers,however,shouldcontacttheappropriatecompaniesformore informationregardingtrademarksandanyrelatedregistrations. Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedinany formorbyanymeans—electronic,mechanical,photocopying,scanning,orotherwise—without priorwrittenpermissionofthepublisher. PermissionsmaybesoughtdirectlyfromElsevier’sScience&TechnologyRightsDepartmentin Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, E-mail:[email protected]. YoumayalsocompleteyourrequestonlineviatheElsevierhomepage(http://www.elsevier.com),by selecting“Support&Contact”then“CopyrightandPermission”andthen“ObtainingPermissions.” LibraryofCongressCataloging-in-PublicationData Applicationsubmitted. ISBN:978-0-12-374720-4 ForinformationonallMorganKaufmannpublications, visitourWebsiteat:www.mkp.comorwww.books.elsevier.com PrintedintheUnitedStatesofAmerica. 08 09 10 11 12 5 4 3 2 1 I dedicate this book to my grandchildren: Gavin, Harper, and Dru—and grand they are! Preface This book is about archiving database data. It is not a general purpose book about archiving. It is also not a book about archiving any or all kinds of data. It is about archiving data in traditional IT databases (e.g., relational databases, hierarchical databases, and structured data in computer files). Examples of source data for archiving are data in DB2 DBMS systems on IBM mainframe sys- tems, data stored in VSAM KSDS databases on IBM mainframes, and data stored in Oracle DBMS systems on Unix servers. It does not include data in spread- sheetsorsmaller relationalsystemssuchasMicrosoftAccessdatabasesondesk- top computers. This perspective might seem narrow; however, this is a critically important emerging topic in IT departments worldwide. Actually I should say that the topic has emerged. The need to archive database data is showing up every- where, although the database tools industry has been slow to provide tools, methodologies, and services to effectively accomplish it. Most IT departments know that they need to have an effective database archiving practice and are trying to figure out how to build one. The data that needs to be archived from databases is generally the most impor- tant business data enterprises keep. It must be retained under very strict rules as specified by laws and common sense. These are not small amounts of data. The archival stores will eventually reach data volumes unheard of even by today’s standards. Archiving is becoming a much more important topic because it is a part of a majorshiftinthinkingaboutdatathatisoccurringthroughouttheUnitedStates and most of the rest of the world. This shift demands that enterprises manage their data better than they have in the past and manage it in a way that serves constituents other than the operating departments of an enterprise. The con- stituents I am talking about include auditors (both internal and external), gov- ernment investigators, customers, suppliers, citizens’ groups, and plaintiff lawyers for lawsuits filed against corporations. Data is becoming more public inthatitreflectsacompany’sactivitiesandperformance.Assuch,itmuststand up to rigid scrutiny and be defended as accurate and authentic. The constituent base also includes businesspeople within the enterprise who are not trying to assert a wrongdoing or to defend against one but rather who legitimately want to look at historical data to achieve some business goal. Although some will say that this is the role of a data warehouse and business intelligence, the reality is that data warehouse data stores generally do not go xiii xiv Preface back more than two or three years. Sometimes you need to reach back farther, and thus the database archive becomes the logical source. Archiving data is one facet of this massive change in thinking about managing data. Other parts are database security, database access auditing, data privacy protection, data quality, and data clarity. The principles of archiving database data are not particularly different from archiving anything else, viewed from the highest level. However, viewed from a low level, the details of database archiving are very different from archiving other types of data. This book drills down into the topics that a database archi- vist needs to know and master to be effective. Since there has been no need or attempt to archive database data until recently, there’s no large body of experts who can guide an IT department through the process of building an effective database archiving practice. In fact, for most IT departments, there are no experts sitting around and little or no organized practice. Everyone is starting on a new venture that will mature into a standard practice in a few years. For now, on your journey you will undoubtedly encounter many stumbling blocks and learning experiences. In addition to a lack of experts, there is a lack of educational material that can beusedtounderstandhowtobecomeadatabasearchivist.Hopefullythisbook will reduce that shortcoming. HOW THIS BOOK IS ORGANIZED Beforegettingtoodeepintodatabasearchiving,itishelpfultobeginbylooking at the generic process of archiving to determine some basic principles that apply to archiving just about anything. Understanding these concepts will be helpful in later grasping the details of database archiving. From the basics, the book moves on to discussing how a database archiving project comes into existence, gets organized, and acquires basic goals and poli- cies. This is an area that is poorly understood and yet crucial to the success of database archiving projects. This discussion is followed by a detailed treatment of designing a solution for a specificapplication.Allthefactorsthatneedtocomeintoplayarediscussed,as are criteria for selecting one path versus another. A description of software required to execute a database archiving application follows the discussion of database archiving design. The strengths and weak- nesses of various approaches and the features that you should look for are all analyzed. Preface xv Administration of the application from day to day over many years is then dis- cussed, followed by some additional topics on the fringe of database archiving. Topicsincludearchivingdatathatisnotbusinesscritical,theroleofthearchive in e-discovery, and other ways to view the execution environment. ORGANIZATIONS THAT ARE LIKELY TO NEED DATABASE ARCHIVING Thisbookisaboutdatabase archiving for largedatabase applications.The types of organizations that would benefit from building a database archiving practice are any that have long-term retention requirements and lots of data. This includes most public companies and those that are private but that work in industriesthatrequireretentionofdata(suchasmedical, insurance,orbanking fields). It also includes educational and government organizations. The book refers to all organizations as enterprises. WHO THIS BOOK IS FOR Thisbookisintendedforthearchivepractitioner,referredtohereasadatabase archiveanalystorsimplyarchiveanalyst.Thebookassumesthatapersonhas adopted this position as a full-time job and possibly as a new career path. I say new because this topic is not taught in universities, nor have many IT depart- ments been practicing it until recently. Database archiving will become a stan- dard business activity in the next few years, and all IT departments will need experts on the topic to be effective. This generates the need for a new profes- sion, just as data modeling, data warehousing, data quality, and data security have spawned specialist data management positions in the last decade. Those who enter into this field will find it a challenging, complex, and rewarding experience.Ifyoubecomeanexpertindatabasearchiving,youwillfindalarge demand for your services. After reading this book, the database archivist will be armed with a complete understanding of archiving concepts and will be conversant with the major issues that need to be addressed. This knowledge should enable the archivist to establish a database archiving practice, evaluate or develop tools and meth- odologies, and execute the planning and operational steps for archiving applications. Other data professionals such as database administrators (DBA), data modelers, application developers, and system analysts will also benefit from the concepts presented here. The requirements for archiving data from databases should be considered when implementing data design, application design, application xvi Preface deployment, and change control. The data archivist is not the onlyperson who will be involved in the process. For example, the database administrator staff willneedtounderstandtheimpactofthearchivingprocessontheiroperational systems and the impact that their actions could have on the archiving process. Likewise, data modelers will need to understand the impacts of their design decisions on archived data, particularly when they are changing existing data structures. ITmanagementwillalsobenefitfromreadingthisbook.Itwillhelpthemunder- stand the principles involved in database archiving and help them in creating and managing an effective organization to control database archiving. An effec- tiveorganizationarchivesonlywhatisneededtoberetainedforlongperiodsof time, does not archive too early or too late, and uses storage devices wisely to minimize cost while preserving the integrity and safety of the archived data. If done correctly, archiving can benefit an enterprise and return more value than it costs, thus not becoming simply another expense. So let’s get started. Acknowledgments It took a lot of conversations to accumulate the thoughts and ideas that went into this book. There is little literature on the topic of data archiving and no place to go to get education. I started this journey three years ago with the belief that archiving was going to emerge as an important new component of the future data management—and I was right. I visited a number of IT departments on the way to discuss their needs and understanding of the topic. In the course of doing so, I talked to dozens of IT professionals. I found a strong understanding of the concept and general prob- lemswithdatabasearchiving.Ifoundlittleinthewayofimplementedpractices. Those that did exist were primitive and lacking in many ways. Here I want to acknowledge the strong impact of these meetings on my understanding and development of this subject. I cannot list the companies due to confidentiality agreements, but many of you will remember my visits. Ialso want tothank theDataManagement Association(DAMA)organizationfor itssupportofmywork.IvisitedalargenumberofDAMAregionalchaptersand gavepresentations on the problems involved in managing long-term data reten- tion. All of these meetings added new information and insights into my knowl- edge of the topic. In particular I want to thank John Schlei and Peter Aiken for aiding me in getting on calendars and in supporting my quest for knowledge. They are both giants in the data management space. I also want to thank all the DAMA regional chapters I visited for adding me to their agendas and for the lively discussions that took place. I visited so many chapters regarding this topic that DAMA International gave me a Community Service Award in 2007. The company I work for, NEON Enterprise Software, launched a development project to produce software to be used for database archiving. At the time of launch I was not part of the development organization. In the past three years many people in that company have spent many hours with me discussing detailedpointsregardingthissubject.Youlearnsomuchmoreaboutatechnol- ogy when you have to produce a comprehensive solution for it. Those in the companywhoIwanttosingleoutfortheirhelpandeducationareJohnWright, Ken Kornblum, Kevin Pintar, Dave Moore, Whitney Williams, Bill Craig, Rod Brick, Bill Chapin, Bruce Nye, Andrey Suhanov, Robin Reddick, Barbara Green, Jim Middleton, Don Pate, John Lipinski, Don Odom, Bill Baker, and Craig Mul- lins.Manyothersalsowereinvolvedintheprojectandwerehelpful.Imustalso addBarbaraGreen,CarlaPleggi,andSamArmstrong,whohelpedingettingthe draftcopiesreadyforthepublisher.AstechnicalasIam,Iamacripplewhenit comes to using technical writing tools. I needed their help. xvii xviii Acknowledgments I especially want to thank John Moores for his encouragement and support on this project. Finally I want to acknowledge the patience and understanding of my family in putting up with me while I worked on this book. Without their support this book could not have existed. 1 CHAPTER Database Archiving Overview Archiving is the process of preserving and protecting artifacts for future use. These artifacts have lived beyond their useful life and are being kept solely for thepurposeofsatisfyingfuturehistoricalinvestigationsorcuriositiesthatmight or might not occur. An archive is a place where these artifacts are stored for longperiodsoftime.Theyareretainedincasesomeonewillwantorneedthem in the future. They are also kept in a manner so that they can be used in the future. Archiving has existed in many forms for centuries. For example, the United States government employs a national archivist. The Presidential libraries are archives. Newspapers retain archives of all stories printed, since papers began tobepublished.Museumsarearchivesofinterestingobjectsfromthepast.Your local police department has an evidence archive. When a collection of items is placed in the cornerstone ofa building during construction in anticipation that someone 100 or more years in the future will uncover it (a time capsule), an archive is being created and the creators are acting as archivists. An archive is created for a specific purpose: to hold specific objects for future reference in the event that someone needs to look at them. The focus is ontheobjectsthataretobeincludedinthearchive.Eacharchivehasaspecific purposeandstoresaspecificobjecttype. The process of archiving follows a common methodology. No matter what you archive, you should go through the same steps. If you leave out one of the steps, you will probably run into problems later on. This generic methodology is discussed in Chapter 3. However, before we go there, it is important to set thescopeofthisbook. Thediscussionthatfollowssegregatesdataarchivingintocategoriesthatare usefulin understanding where this book fits into thebroader archivingrequire- ments. It also establishes some basic definitions and concepts that will be used later as we get deeper into the process. 3