TEXT ENCODING INITIATIVE Text Encoding Initiative Background and Context Edited by Nancy Ide Department o/Computer Science, Vassar College, Poughkeepsie, NY, USA AND Je an Veronis Laboratoire Parole et Langage, CNRS & Universite de Provence, Aix-en-Provence, France Reprinted from Computers and the Humanities, Volume 29, Nos. 1,2 & 3 (1995), edited by Glyn Holmes (With the addition of an SGMIJTEI Bibliography) Springer Science+Business Media, B.V. Library of Congress Cataloging-in-Publication Data Text encodlng InItIatIve: background and contexts / edlted by Nancy Ide and Jean Veronls. p. CII. "Reprlnted frolll COllputers and the hUllanltles 29: 1-3, 1995." ISBN 978-0-7923-3704-1 ISBN 978-94-011-0325-1 (eBook) DOI 10.1007/978-94-011-0325-1 1. Text processlng (Collputer sclencel 2. Codlng theory. 1. Ide, Nancy M. II. Veronls,Jean. III. COllputers and the hUllanltles. OA76.9.T48T47 1995 005.7'2--dc20 95-31289 ISBN 978-0-7923-3704-1 Printed on acid-free pap er AII Rights Reserved © 1995 Springer Science+Business Media Dordrecht Originally published by Kluwer Academic Publishers in 1995 Softcover reprint ofthe hardcover Ist edition 1995 No part of the material protected by this copyright notice may be reproduced Of utilized in any fonn or by any means, electronic or mechanical, including photocopying, recording or by any infonnation storage and retrieval system, without written pennission from the copyright owner, Table of Contents CHARLES GOLDFARB / Preface 1 NANCY IDE and JEAN vERONIS / Introduction 3 PART I: GENERAL TOPICS NANCY IDE and C.M. SPERBERG-McQUEEN / The Text Encoding Initiative: Its History, Goals, and Future Development 5 C.M. SPERBERG-McQUEEN and LOU BURNARD / The Design of the TEl Encoding Scheme 17 LOU BURNARD / What is SGML and How Does It Help? 41 PART ll: DOCUMENT-WIDE ENCODING ISSUES HARRY GAYLORD / Character Representation 51 RICHARD GIORDANO / The TEl Header and the Documentation of Electronic Texts 75 DOMINIC DUNLOP / Practical Considerations in the Use of TEl Headers in Large Corpora 85 PART Ill: ENCODING SPECIFIC TEXT TYPES DAVID CmSHOLD and DAVID ROBEY / Encoding Verse Texts 99 JOHN LAV AGNINO and ELL! MYLONAS / The Show Must Go On: Problems of Tagging Performance Texts 113 ROBIN COVER and PETER ROBINSON / Textual Criticism 123 DANIEL GREENSTEIN and LOU BURNARD / Speaking with One Voice: Encoding Standards and the Prospects for an Integrated Approach to Computing in History 137 STIG JOHANSSON / The Encoding of Spoken Texts 149 ALAN MELBY / E-TIF: An Electronic Terminology Interchange Format 159 NANCY IDE and JEAN vERONIS / Encoding Dictionaries 167 PART IV: SPECIAL ENCODING MECHANISMS STEVEN J. DeROSE and DAVID DURAND / The TEl Hypertext Guidelines 181 D. TERENCE LANGENDOEN and GARY F. SIMONS / Rationale for the TEl Recommenda- tions for Feature-Structure Markup 191 DAVID BARNARD, LOU BURNARD, JEAN-PIERRE GASPART, LYNNE A. PRICE, C.M. SPERBURG-McQUEEN, GIOVANNI BATTISTA VARILE / Hierarchical Encoding of Text: Technical Problems and SGML Solutions 211 SGMLlTEI Bibliography by Robin C. Cover 233 Computers and the Humanities 29: I, 1995. Preface Charles F. Goldfarb Saratoga. California If asked for a sure recipe for chaos I would propose a I am delighted that my invention, the Standard project in which several thousand impassioned special Generalized Markup Language, was able to play a ists in scores of disciplines from a dozen or more role in the TEl's magnificent accomplishment, particu countries would be given five years to produce some larly because almost all of the original applications 1300 pages of guidelines for representing the informa of SGML were in the commercial and technological tion models of their specialties in a rigorous, machine realms. It is reasonable, of course, that organiza verifiable notation. Clearly, it would be sociologically tions with massive economic investments in new and and technologically impossible for such a group even changing information should want the benefits of infor to agree on the subject matter of such guidelines, let mation asset preservation and reuse that SGML offers. alone the coding details. But just as clearly as the It is gratifying that the TEl, representing the guardians bumblebee flies despite the laws of aerodynamics, the of humanity's oldest and most truly valuable informa Text Encoding Initiative has actually succeeded in such tion, chose SGML for those same benefits. an effort. The vaunted "information superhighway" would The TEl Guidelines are extraordinary. Even if they hardly be worth traveling if the landscape were were never adopted they would stand as a significant dominated by industrial parks, office buildings, and contribution to scholarship for their detailed analysis shopping malls. Thanks to the Text Encoding Initia of the information sets of a huge range of complex text tive, there will be museums, libraries, theaters, and types. But in fact they have already been implemented, universities as well. both by scholars for research and interchange and by commercial publishers for the publication of linguistic and humanistic works. Computers and the Humanities 29: 3-4. 1995. 3 Introduction * Nancy Ide Department ofC omputer Science. Vassar College. Poughkeepsie. New York 12601. U.sA. e-mail: [email protected] and Jean Veronis Laboratoire Parole et Langage. CNRS & Universite de Provence 29. Avenue Robert Schuman. 13621 Aix-en-Provence Cedex 1, France e-mail: [email protected] The Text Encoding Initiative (TEl) Guidelines for many of the challenges of text encoding that the lEI Electronic Text Encoding and Interchange (Sperberg working groups had to face are not obvious in what McQueen and Burnard, 1994), known familiarly as was eventually included in TEl P3. This collection lEI P3 (i.e., "lEI Proposal number 3'') or just "the of papers is intended to fill this gap, by providing a Guidelines", appeared in May, 1994. The Guidelines background and context for the contents of lEI P3. are the result of over six years' work by dozens of The idea for this collection came primarily as a scholars from allover the world who were involved result of our experience in one of the working groups in lEI working groups, providing their conclusions in which we participated, concerned with encoding concerning the optimal way to consistently and com dictionaries. At the end of our work it was clear to us prehensively encode a vast range of text types and that much of the intellectual process that went into the features. As such, lEI P3 represents a pioneer effort final decisions is not apparent in reading the chapter of in an area where only occasional and isolated attempts the Guidelines concerned with dictionaries alone, or, had been made before, and will certainly serve as the for that matter, any other part of TEl P3. In fact, within primary basis for encoding texts in electronic form for the dictionary working group there was considerable the foreseeable future. debate, and a number of extremely interesting ideas As a reference manual, lEI P3 necessarily con and insights for a dictionary encoding scheme were tains only the final result of the sometimes lengthy proposed. Obviously, many of these ideas were ulti and occasionally heated discussions within the lEI mately not included in the final scheme; but readers of working groups that produced the final recommenda lEI P3 alone will not be aware of these ideas nor, more tions. Contextual information, lengthy explanations, importantly, the reasons why some were rejected in the and extensive exemplary materials could obviously not final recommendations. However, this information is be included in the Guidelines, which comprise two valuable for users of the dictionary encoding guide volumes spanning over 1300 pages. Most importantly, lines, both to answer their questions about the choices we eventually made and to highlight the fundamental problems for encoding this text type. We felt strongly • Nancy Ide is Associate Professor and cbairofComputer Science at Vassar College. and Visiting Researcher at CNRS. She is president that readers of the dictionary chapter of lEI P3, and oft he Association for Computers and the Humanities and chair of the in fact any chapter of TEl P3, would benefit from a Steering Committee of the Text Encoding Initiative. Jean Veronis discussion of the rationale behind what the working is Maitre de Conferences in Computer Science at the Universite de groups finally produced. We therefore put together this Provence and head of the Natural Language Processing Group of Laboratoire Parole et Langage. collection of papers, comprising contributions from 4 members of virtually all TEl committees and working problems present in all text types, particularly char groups. acter representation and bibliographic documentation There is an even more compelling reason for of electronic materials. The third section of the collec collecting this series of papers. The TEl is a pioneer tion provides a survey of the encoding concerns for a ing effort; the TEl working groups were the first to range of important specific text types, including all of comprehensively address the substantial intellectual those treated in TEl P3. This section not only looks problem of representing textual data in electronic form. at specific text types in isolation, but also serves to They were faced with a vast array of new problems: highlight many of the fundamental encoding problems it became immediately apparent that the development across text types - such as recoverability of original of a text encoding scheme demanded much more than source materials, the representation of different assigning tag names to features, and included looking "views" of a text, etc. - and demonstrates how they at the conceptual structure of texts and determining were approached by different groups with often very the commonalities across different text types. It also different problems and concerns. The final section is demanded finding the most consistent and effective concerned with mechanisms for a few special encoding ways to encode texts using the Standard Generalized problems not particular to any text type, including Markup Language (SGML), which provides only the cross-reference and linkage, alignment of non fundamental machinery for marking texts and provides contiguous elements, and dealing with text elements for many alternative means to encode texts. There which do not conform to SGML's basic requirement fore, the work of participants in the TEl not only for an entirely hierarchical, nested internal structure. involved consideration of problems of text encoding Many of these papers have repercussions for SGML that are likely to be with us for decades to come, use in general, and constitute valuable contributions to but also required the development of a methodology, some of the more technical discussions concerned with from scratch, for approaching these problems. These problems of text encoding. pioneering efforts, while likely to be refined and This collection is the result of much effort and extended, should not be lost; they provide the intel enthusiasm by everyone over the past two years, but lectual basis upon which text encoding practices will most of all the authors, who worked hard to fit their build in the future. This collection is therefore also an papers into the larger whole. Our thanks to them and to attempt to document the course of these efforts. GlynHolmes, the editor of CHum, for their efforts. We The papers in this collection are grouped in four hope that this collection will prove to be as valuable a sections. The first section provides background mate companion to the TEl Guidelines as we envisioned at rial for the remainder of the papers, by outlining the the outset, and that it will contribute to the on-going motivation for establishing the TEl together and the development of text encoding research and practice. development of its overall goals and philosophy, giving a summary of the overall design of the TEl scheme, and providing a brief tutorial on SGML. The authors of Reference papers in the remaining sections of this collection have assumed a basic familiarity with TEl P3 and SGML Sperberg-McQueen. C.M. and L. Burnard. eds. Guidelinesfor Elec tronic Text Encoding and Interchange. Chicago and Oxford. that is, for the most part, provided by the papers 1994. in section one. The second section treats encoding Computers and the Humanities 29: 5-15. 1995. 5 @ 1995 Kluwer Academic Publishers. The TEl: History, Goals, and Future * NancyM. Ide Department of Computer Science. Vassar College. Poughkeepsie. New York. 12601. U.SA. e-mail: [email protected] and C.M. Sperberg-McQueen Computer Center, University ofI llinois at Chicago. Chicago. Illinois 60680. U.SA. e-mail: [email protected] Key words: TEl, electronic texts, text encoding, encoding standards, SGML, tagging Abstract This paper traces the history of the Text Encoding Initiative, through the Vassar Conference and the Poughkeepsie Principles to the publication, in May 1994, of the Guidelines for the Electronic Text Encoding and Interchange. The authors explain the types of questions that were raised, the attempts made to resolve them, the TEl project's aims, the general organization of the TEl committees, and they discuss the project's future. 1. Overview '80s. These schemes typically reflected the specialized interests of their originators and were, by and large, Before they can be studied with the help of computers, incompatible; the end result was that a text encoded for texts must be encoded in computer-readable form. one purpose or piece of software often required sub Standard data processing practice provides conve stantial editing to be used for another purpose or with nient solutions for basic text representation problems, other software, if it was reusable at all. Recognizing but many texts of interest to scholarly research present this, the humanities computing community attempted difficulties not resolved by industrial standards. There very early to launch efforts to develop encoding fore, over the years scholars have developed a variety standards for computer-readable texts intended for of methods for representing special characters, encod scholarly research (San Diego 1977, Pisa 1980). How ing logical divisions of a text, representing analytic or ever, these efforts failed to generate consensus on how, interpretative information, and reducing text-critical or even whether, such a standard should be developed, apparatus to a single linear sequence. Because of the and thus they were aborted at the outset. lack of a unified, standard format, scores of such In November of 1987, the Association for Com encoding schemes were developed from scratch or puters and the Humanities (ACH) convened a meeting adapted from existing schemes in the 196Os, '70s, and at Vassar College in Poughkeepsie, New York, of over 30 representatives from archives, humanities comput • Nancy Ide is Associate Professor and chair of Computer Sci ing centers, and professional organizations, to consider ence at Vassar College. and Visiting Researcher at CNRS. She is once again the standardization question. I This group president of the Association for Computers and the Humanities and chair of the Steering Committee of the Text Encoding Initiative. agreed not only on the need for common practice but C. M. Sperberg-McQueen is a Senior Research Programmer at the also on a set of basic principles to guide the develop academic computer center of the University of l11inois at Chicago; ment of guidelines for the encoding and exchange of his interests include medieval Germanic languages and literatures literary and linguistic data, now commonly referred to and the theory of electronic text markUp. Since 1988 he has been editor in chief of the ACHI ACUA LLC Text Encoding Initiative. as the "Poughkeepsie Principles":2 6 I. The guidelines are intended to provide a standard chaos; it was, as several speakers observed, the status format for data interchange in humanities research. quo. 2. The guidelines are also intended to suggest princi Following the Vassar conference, the ACH was ples for the encoding of texts in the same format. joined by the Association for Literary and Linguistic 3. The guidelines should Computing and the Association for Computational a. define a recommended syntax for the format Linguistics in driving the standards effort, thus forming b. define a metalanguage for the description of text the Text Encoding Initiative (TEl). The three organi encoding schemes zations pledged to guide the effort and seek funding c. describe the new format and representative to support the TEl as an international, multi-lingual existing schemes both in that metalanguage and project to develop guidelines for the preparation and in prose interchange of electronic texts for scholarly research.4 4. The guidelines should propose sets of coding con Very quickly, it was recognized that the TEl's goals ventions suited for various applications. served not only humanities scholarship, but were crit 5. The guidelines should include a minimal set of ical for a broad range of applications by the language conventions for encoding new texts in the format. industries more generally. It has become crucial for 6. The guidelines are to be drafted by committees on both research and industry to ensure that any text that a. text documentation is created can be used and, more importantly, reused b. text representation for any number of applications, including applications c. text interpretation and analysis which have not yet been imagined or developed. Thus, d. metalanguage definition and description of since its inception, the work of the TEl has achieved existing and proposed schemes increasing importance for text-based work across dis coordinated by a steering committee of representa ciplines and applications. tives of the principal sponsoring organizations. In May 1994, the TEl issued the first full version 7. Compatibility with existing standards will be main of its Guidelines for Electronic Text Encoding and tained as far as possible. Interchange.s This report, which provides encoding 8. A number of large text archives have agreed in conventions for a large range of text types and features principle to support the guidelines in their function relevant for research in language technology, the as an interchange format. We encourage funding humanities, and computational linguistics, represents agencies to support development of tools to facili a major milestone: never before the TEl was it possible tate this interchange. to achieve such broad consensus among the research 9. Conversion of existing machine-readable texts to community about encoding conventions. the new format involves the translation of their In developing its Guidelines, the TEl identified the conventions into the syntax of the new format. No encoding needs for interchange and for the varied pro requirements will be made for the addition of infor cessing and analysis needs of the research community, mation not already coded in the texts. laid out on this basis the encoding principles demanded The success of the Vassar conference had several for a general purpose scheme, and identified key text sources. First, at the time of the conference more was types and features for which encoding conventions known about encoding problems and basic principles needed to be developed. In most cases there were no were clearer than at the time of the earlier efforts pre-existing encoding conventions. In almost as many already mentioned. Second, the Vassar group included cases, there had not even been any usable prior anal a far more robust representation of key organizations ysis of the required categories and features and their and active research centers than had been gathered relations for a given text type, in the light of real and before. Third, the recently developed Standard Gener potential processing and analytic needs. The TEl moti alized Markup Language (SGML)3 provided a tool for vated and accomplished the substantial intellectual task developing a simple, flexible, and extensible encoding of completing this analysis for a large number of text scheme capable of satisfying the widely varying needs types and provided encoding conventions based upon of textual research. Finally, the consensus reflected the it. The TEl's achievements include: growing urgency of the need. At earlier meetings, it was predicted that if the humanities computing com I. determination that the Standard Generalized munity did not adopt a common practice, chaos would Markup Language (SGML) is the appropriate ensue. At the Vassar meeting, no one needed to predict framework for development of the Guidelines;