ebook img

DTIC ADA632254: Fast Crash Recovery in Distributed File Systems PDF

143 Pages·7.7 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA632254: Fast Crash Recovery in Distributed File Systems

Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED JAN 1994 2. REPORT TYPE 00-00-1994 to 00-00-1994 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Fast Crash Recovery in Distributed File Systems 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION University of California at Berkeley,Department of Electrical REPORT NUMBER Engineering and Computer Sciences,Berkeley,CA,94720 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT This thesis presents fast crash recovery: a simple, efficient, and inexpensive method for increasing availability in distributed systems. In fast crash recovery we assume that critical resources will fail, and we do not attempt to mask the failures with redundant hardware or software. Instead, we design the system to recover so quickly that there is little downtime. This approach is intended for environments that can tolerate occasional failures and cannot afford the cost and overhead of redundant resources. In particular, I focus on fast recovery of distributed state. An example of distributed state is the file caching information maintained by servers in most modern file systems. This information describes the state of file caches on client workstations. After a crash, a server must recover this information in order to guarantee the consistency of the caches. Unfortunately, distributed state recovery can be slow and complex. The techniques I have developed reduce state recovery from several minutes to under six seconds for a Sprite file server with 40 clients. This thesis evaluates three distributed state recovery techniques based on their speed, complexity, and performance overhead. In client-driven recovery clients send their state information to the server after a crash. The server uses this information to regenerate its copy of the distributed state. Server-driven recovery is a modification of client-driven recovery that is faster and eliminates cache inconsistencies that can arise during client-driven recovery. The fastest technique is transparent recovery, so-called because client workstations do not communicate with the server during recovery. Instead, the server stores its distributed state in a protected area of its own main memory called the recovery box. The interface to the recovery box helps detect and prevent corruption of this state information. To achieve fast overall recovery times, we must also recover other parts of the system quickly. For example, we can eliminate a lengthy file system consistency check by using a log-structured file system that recovers in seconds. By combining the improvements described in this thesis, a Sprite file server can reboot in under 30 seconds. This is two orders of magnitude faster than most modern file systems recover. In addition to evaluating distributed state recovery techniques, this thesis presents some overall guidelines for designing distributed systems that will recover quickly from crashes. 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE Same as 141 unclassified unclassified unclassified Report (SAR) Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.