ebook img

Computer science and technology and their application PDF

306 Pages·1974·16.447 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Computer science and technology and their application

Computer Science and Technology and their Application ADMINISTRATIVE EDITORS MARK I. HALPERN WILLIAM C McGEE CONTRIBUTING EDITORS LOUIS BOLLIET ANDREI P. ERSHOV J. P. LASKI PERGAMON PRESS OXFORD · NEW YORK · TORONTO · SYDNEY Pergamon Press Ltd., Headington Hill Hall, Oxford Pergamon Press Inc., Maxwell House, Fairview Park, Elmsford, New York 10523 Pergamon of Canada Ltd., 207 Queen's Quay West, Toronto 1 Pergamon Press (Aust.) Pty. Ltd., 19a Boundary Street, Rushcutters Bay, N.S.W. 2011, Australia Copyright © 1974 Pergamon Press Inc. All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of Pergamon Press Ltd. First edition 1974 Library of Congress Catalog Card No. 60-12884 Printed in Great Britain by A. Wheaton & Co., Exeter ISBN 0 08 017806 5 A Tutorial on Data-Base Organization ROBERT W. ENGLES IBM Corporation, Poughkeepsie, New York Abstract The purpose of this report is to clarify certain issues of data-base support. The main issues are data independence, security, integrity, search, and the integrated data base. The first section of the report is an introduction, which includes data- management history, trends, and terminology. The second section presents a theory of operational data based on the notions of entity sets and data maps. The third section is an exposition of data-bank design, emphasizing struc- ture, search, and maintenance. The fourth section shows why data indepen- dence is a necessary feature of a viable data base system. The report should not be construed as representing a commitment or intention of IBM. The opinions expressed are personal and do not represent a corporate position. The intent of the report is tutorial, and the viewpoint is that of a systems programmer. Data independence Integrated data base Data integrity Search Data management Systems programming Data security 21 Programming 1 2—AP · Preface The heart of an information system is its files or data base. The purpose of this report is to clarify certain issues of data-base support. The main issues are data independence, security, integrity, search, and the integrated data base. The first section is an introduction which includes data-management history, trends, and terminology. The evolution of data-management software is viewed in terms of the growing distinction between file organization (logical structure) and data organization (physical struc- ture). A system structure is described to provide a framework for the definition of data-base concepts and terminology. The second section presents a theory of operational data based on the notions of entity sets and data maps. Entities are the things about which we record facts. Facts are relationships, and data maps are a means of defining relationships. The section includes an analysis of data maps, data-set organizations, and retrieval requests. The third section is an exposition of data-bank design, emphasizing structure, search, and maintenance. Hierarchical, multi-list, variable and fixed symbol-list organizations are described and compared. The section includes an analysis of complex structure, indexing techniques, and update problems. The final section shows why data independence is a necessary feature of viable data base support. The presentation emphasizes the need for a logical data organization against which application programmers can define file organizations, and against which data-base administrators can define data organizations. The notion of the entity record set is suggested as the basis of such a logical data organization. This report should not be construed as representing a commitment or intention of IBM. The opinions expressed in this report are personal and do not represent a corporate position. The intent of the report is tutorial and the viewpoint is that of a systems programmer. This report contains references to a selected bibliography which appears after the last section. A more extensive bibliography of data-base organization will be in found in réf. [1]. 1. Introduction 1.1. BACKGROUND Traditionally, file organization is the process or result of relating the stored-data requirements of a particular application to the phy- sical characteristics of a particular type of input/output device. A 4 i?. W. Engles more current definition would allow for the possibility of multiple applications and a class of I/O devices. In discussing data independence, it is necessary to distinguish between the logical organization of data appropriate to an appli- cation and the physical organization of the data base. We will use the term "file organization" to refer to the structure seen by the application programmer and "data organization" to refer to the actual arrangement of the stored data. It is not possible to use terminology that is consistent with all references. In ref. [19], for example, the terms "data structure" and "storage structure" are used for file organization and data organization, respectively. Our basic concepts and terminology are illustrated in Fig. 1. Traditionally, a file is defined as a collection of related records, and a record is defined as a collection of related fields. These terms need not be redefined provided it is clear that they refer to application-oriented units of information, not systems-programming- oriented units of stored data. The distinction is fundamental. Indeed, the exact relationship between files and stored data is an essential specification of any data management system. This relationship is used in an inexact way in Fig. 2 to categorize levels of data- management software. The category labelled the "past" is the level of no data manage- ment. In this category, software is limited to device handling and file organization is not distinguished from data organization. (The only difference between the file and the data is that many generations of data may exist for the "same" file.) Typically, the data organization is suitable for one type of device and one application program. INFORMATION File Data- STORAGE Organization Organization Information: The meaning assigned to data by known conventions. Data: Any representations to which meaning may be assigned. Storage: A device into which data can be inserted, in which it can be retained, and from which it can be retrieved. Data Organization: The correspondence between the structure of data and the structure of storage. File Organization: The correspondence between the information structure and the structure of the data. FIG. 1 A Tutorial on Data-Base Organization 5 File Data File Data Past F lel Data set File Data set n Present Fi les Data base Future FIG. 2 Furthermore, the programmer's intimate knowledge of the data organization is embedded in both the logic and the instructions of the application program. The result is that the program, the data and the type of storage device are tightly bound, a situation which makes it difficult to change anything or use the same data for other applications. Any change to the data organization requires rewriting, recompiling, and retesting the programs which use the data. Efficient access to the data is limited to a particular search algorithm; i.e., either sequential or direct on one key. While much of the data might be of interest to another application it will probably include fields 6 R. W. Engles that are not of interest, exclude fields that are of interest, and not be organized in the sequence that represents the relationships of interest to the other application. Given the premise that each application should be optimized, the usual solution to these problems is not to use the same data across applications. The category labeled the "present" includes conventional input/ output control systems (IOCS) and software which provides limited data-base support. Even at the IOCS level a distinction exists between file organization and data organization. For example, the application programmer views a particular file as a contiguous collection of fixed-format card images; the data organization differs from the file organization in that the data consists of blocked records stored on multiple volumes. However trivial, this is a form of data independence since the software provided blocking or de- blocking and end-of-volume processing can be transparent to the application program. Changes to the blocking factor or number of volumes will not require changes to the application program. Current IBM operating systems and program products provide various degrees of data independence. All are limited. With function as the criteria, we can classify the present support into three levels : 1. The level of input/output control as exemplified by DOS. 2. The level of data-set control as exemplified by OS/360. 3. The level of data-base control as exemplified by IMS/360. At the level of input/output control, software will control the use of the channels and devices, provide error-correction procedures, label processing, symbolic device addressing, end-of-volume proces- sing, and various access methods. At the level of data-set control, software assumes responsibility for direct-access device storage management and maintaining a catalog of data sets. This level also includes a greater degree of data-set protection and device indepen- dence. At the level of data-base control, software provides mechanisms for eliminating redundancy and sharing data across applications. This level includes a greater separation of file organization and data organization in that the file may be a subset of a data base. In IMS, for example, a data base consists of one or more data sets and many different files can be defined against the same data base. The records of a file are defined in terms of segments—a unit of stored data consisting of contiguous fields. During execution, IMS selects those segments required by an application program. Programs are dependent on the composition of the segments they use and the hierarchical relationships among segments; otherwise, changes can be made to the data organization without requiring changes to the application programs [2]. A Tutorial on Data-Base Organization 7 1.2. TRENDS Viable data-base support must provide further separation of file organization and data organization. In the future it should be possible to define a file in terms of the information required by the application without regard to the organization of the data base. Defining a file in terms of the required information implies a frame of reference composed of objects and relationships in the real world, as opposed to arrangements of data objects in storage. Limited only by what information is available from the data base, it should be possible to specify a file in terms of a set of objects and what facts are desired about these objects. These facts include relationships to other objects, about which one may want certain facts, etc. It should also be possible to select a subset of objects by presenting facts which characterize these objects. Furthermore, it is necessary to specify the sequence and format in which the information is desired. These requirements have tremendous implications for search and data-base organization. First, it is necessary to distinguish data access from data organization. In the past, it has been customary to design a data organization specifically for a particular method of access. The various access methods of current operating systems, for example, can be used only with the appropriate data organizations. As Mealy [3] puts it: . . the correspondence between the structure of the data and the structure of storage, we call the data organi- zation. While this enables data access, it is not access. Access is a feature of the processing of the data, not of the data itself or how it is represented; different procedures will, in general, want to access the same data in different ways and orders. The order in which data items are fetched and stored is (or should be) independent of the data organization." For direct access, it has been customary to identify a particular field as the key and arrange the date to facilitate search on values of that field. With the direct access method, for example, the arrange- ment of data is determined by the key transformation algorithm. With the indexed sequential access method, the data is arranged in accordance with the collating value of the key field. Either way, efficient access to the data is only possible through a single key field. There have always been requirements for access by more than one key. In the future, every field should be considered a potential key for searching. If a data base is to represent useful information about the real world, then it must reflect some of the complexity of that world. If we view a data base as a representation of information about sets of objects such as parts, products, warehouses, organizations, people, sales orders, purchase orders, manufacturing orders, etc., it is clear 8 R. W. Engles that there are many complex relationships among these sets of objects. The data base must contain representations of the objects, simple facts about the objects, and structural facts; i.e., represen- tations of the relationships among the objects. The structure is obviously a complicated graph. Indeed, the relationships among a single set of objects such as parts can be a complicated graph. With the notable exception of problems such as the bill of material explosion/implosion of a product structure parts file, con- ventional batch applications do not require a nonredundant representation of complex structure. Even when applications and their files are interrelated, the files required by a single application need only represent a subset of the total relationships and these relationships are typically no more complicated than a tree structure. With batch processing methods, the data for the files can always be prepared before the execution of each application program. Select- ing, merging, and sorting data between runs removes the need for any one collection of data to have complex structure. In effect, the complex structure of the total relationships is represented by distinct collections of simple structures. Each application program has its own files in the form of separate collections of data. Of course, data then exists in a highly replicated form causing problems in data integrity or file maintenance. However, in a batch processing envir- onment, this problem can be solved by the judicious scheduling of sequential runs. In an environment characterized by the random arrival of hetero- geneous inputs, batch processing methods no longer apply. If inquiries and transactions are to be handled as they occur, files cannot be prepared before each run and the problem caused by the replication of data cannot be solved by sequential scheduling of runs. It becomes necessary to design data banks that represent a maximum of relationships with a minimum of redundancy. The design process may be viewed as combining the files of all appli- cations and factoring out the common information. The result is called an integrated data base. The issues of data independence, search, and the integrated data base are highly interrelated. One of the purposes of data indepen- dence is to allow an installation to evolve to an integrated data base without becoming bogged down in program maintenance. Data independence implies that application programs reference data by name and search is a system function. In order for a system to perform this function and also to provide data integrity and security, the mapping between file organizations and data organization must be formalized. Knowledge about data, which today is embedded in the procedural steps of a program, must be made explicit at the system level. Data must be consistently defined and centrally con- A Tutorial on Data-Base Organization 9 trolled. The relationships among data elements must be formally described. 1.3. TERMINOLOGY The design of effective data-base support must be based on consistent, coherent, and adequate notions about operational data. Today, we do not seem to have any "theory of operational data". However, we believe that such a theory is evolving and this is the subject of the next section of this tutorial. Unfortunately, we must first face up to the problem that exists whenever there is no clear and commonly agreed upon theory: the terminology problem. It is doubtful that all readers of this paper will agree with our use of the few technical terms introduced so far, which is exactly why it is necessary to define more terminology before proceeding any further. The trouble with terminology—as with data—is that meaning is a function of context. To provide context for our definitions, it is necessary to present a system structure. This is not a design, but merely a conceptual framework within which we can talk about data. We assume a multitasking system with a mixture of batch, transaction, and interactive processing. Of course, the system is oriented to on-line use of a common data base. As with any computer-based system, the major components of the system are hardware, people, rules, programs, and data. The major classes of data and the relationship of programs and data is shown in Fig. 3. The terminal management function includes the care and feeding of the lines, message queueing and routing, and control of local and remote terminal devices. The processing programs are responsible for the interpretation of their input data, requesting file update or retrieval, and the composition of output data. The data-base man- agement function includes the interpretation of data independent file requests, search, and control of the creation, maintenance and use of the data base. A data base is defined as the total collection of stored, operational data used in the application systems of a particular enterprise. Operational data represents certain information about entities of concern to the enterprise; at any one time this data is distinct from input data, output data, programs and other types of data. The information content of a data base includes data stored on volumes of secondary storage and data derivable from the stored data by functions from the program base. Output data includes answers and reports which are a function of the information content of the data base. Input data includes representations of questions, procedures, and transactions. Transac- tion records may become part of the data base or cause changes to

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.