Preface In 1996, a research programme on translation and interpreting was jointly launched by the Language Divisions of the universities in Stockholm and Uppsala, with funding from the Bank of Sweden Tercentenary Foundation. One of the projects funded as part of this programme is the ETAP parallel corpus project in the Department of Linguistics at Uppsala University. Later, the department received funding from another quarter to initiate another parallel corpus project, the PLUG project, in cooperation with two other Swedish universities. Thus, we had two parallel corpus projects ongoing in our depart- ment, with partly differing and partly overlapping aims. We were also aware that there was quite a lot of work going on in Scandinavia and elsewhere on parallel corpora and their uses, and we felt that it would be a good idea to try to bring together parallel corpus researchers for an exchange of experiences and ideas. Hence, on 22–23 April, 1999, PKS99, a symposium devoted to all aspects of parallel and comparable corpora took place at Uppsala University. This volume contains edited versions of a selection of the symposium presenta- tions. It starts with a general introduction to the papers and an overview of the field by Borin. The remaining papers cover a wide range of topics, grouped into four topical subsections: (1) general parallel and comparable corpus project presentations (Johansson, Sågvall Hein, Axelsson and Berglund); (2) discussions of specific linguistic applications of parallel and comparable corpora (Salkie, Trosterud, Geisler); (3) descriptions of the development and use of computational tools for parallel corpus linguistics (Grefenstette, Merkel et al., Stahl, Tiedemann) and (4) papers on parallel corpus annotation (Prütz, Borin). The stated aims of the symposium were to assess the state of the art of parallel corpus research in general, and in Scandinavia in particular, as well as to bring together parallel corpus researchers for an exchange of experiences and ideas. These aims were amply attained in a number of ways, as this volume hopefully will bear witness to, but they would have come to nothing without those people from all over who attended the symposium, or those in the Department of Linguistics, Uppsala University, who worked behind the scenes to make everything run smoothly. My heartfelt thanks go to you, to Uppsala University’s Faculty of Languages and to the Bank of Sweden Tercentenary Foundation, in the guise of the research programme Translation and Interpreting—a Meeting between Languages and Cultures, for the funding which made the symposium possible, and to the Language and Computers series editor, Jan Aarts, who immediately invited me to his home in Nijmegen to discuss the manuscript with him when I happened to be in the vicinity. Lars Borin … and never the twain shall meet? Lars Borin Department of Linguistics, Uppsala University Abstract Parallel and comparable corpora are playing an increasingly important role in linguistics and computational linguistics. This introduction aims at providing an overview of the state of the art of parallel and comparable corpus research, paying particular attention to the situation in Scandinavia. The existence of two distinct and partly separate research traditions in parallel corpus linguistics is noted and discussed, as are the issues of parallel corpus terminology, the creation of parallel and comparable corpora, and the uses and potential uses of parallel and comparable corpora in linguistics and computational linguistics. Finally, we look at the development of tools for the creation and processing of such corpora. 1. Introduction In the last decade or so, parallel corpus linguistics has emerged as a distinct field of research within corpus linguistics, itself a fairly young discipline. Work in parallel corpus linguistics is of course normally presented in the same forums as general corpus linguistics research, but there is also an increasing number of scientific meetings devoted solely or mainly to parallel and comparable corpora, or to more specific research issues in connection with the creation, annotation and processing of such corpora. In parallel corpus linguistics—as the name implies—the corpus which is the object of research consists of texts which are ‘parallel’ in some sense (to be defined in more detail in the next section). However, a very common kind of parallel corpus—the prototypical kind, one is tempted to add—is that which consists of original texts in one language, together with their translations into another language. The obvious and immediate uses for such a corpus, in conjunction with the parallel corpus counterparts of the processing tools—concordancers, phrase extractors, statistics packages, etc.—which have been found so useful in dealing with non-parallel corpora, are in (machine) translation research, including the development of example-based machine translation systems, language teaching and the teaching of translation, bilingual lexicography and contrastive and typological linguistics. The literature is full of examples of the use of parallel corpora for research and development in these areas (see, e.g., Ahrenberg et al. 2 Lars Borin 1998a; Ebeling 1998a; Johansson 1998, this volume; Botley et al. 2000; Sågvall Hein this volume; Salkie this volume). However, from the work done on parallel corpora in our department at Uppsala University, from the papers in the present collection, and from other recent research reported in the literature, it is clear that there are other uses—not equally obvious, but nonetheless at least as important as those just mentioned—for parallel corpora and parallel corpus processing tools. We will return to those uses in section 5, below. The purpose of this introductory chapter is to present an overview of the state of the art of parallel corpus research, particularly the situation in Scandinavia, and also to provide a fairly comprehensive and up-to-date bibliography for this field. It is organised in the following manner. Since the field has not yet settled on an unambiguous terminology, the next section is devoted to sorting out the various uses of the terms most frequently encountered in the literature on parallel corpus linguistics. In section 3, I portray the two rather different kinds of research which are pursued under the heading of parallel corpus linguistics, and try to clarify what their differences are and what they have in common. Sections 4, 5 and 6 paint a picture of the current state of the art in the creation, use, and processing of parallel corpora, respectively, and I also try to show how the research presented in the contributions in this volume fits into this picture. Finally, in the concluding section, I try to see where we are headed, although it is risky to attempt to forecast even the near future of such a quickly changing field as parallel corpus linguistics. 2. Terminology: parallel and comparable corpora Even though there is no great wealth of terms specific to the field of parallel corpus linguistics, it seems that few, if any, of the terms used have unambiguous denotations for all parallel corpus researchers. For the corpora themselves, at least the following terms are found in the literature, with the meanings listed under each term. (1a) parallel corpus (parallel texts) ‘collection of translationally related texts’ “two (or more) subcorpora which exhibit some kind of parallelism” (Ebeling 1998a) “collection of functionally similar (original) texts (in two or more languages)” (Hartmann 1997, cited by Peters and Picchi 1997) “texts originally written in language A and their translations into languages B (and C ...)” (Teubert 1996: 245, cited by Lawson 2001: 284) “an equal amount of texts originally written in language A and B and their respective translations” (Teubert 1996: 245, cited by Lawson 2001: 284) …and never the twain shall meet? 3 “only translations of texts into the language A, B and C, whereas the texts were originally written in language Z” (Teubert 1996: 245, cited by Law- son 2001: 284) (1b) comparable corpus (comparable texts) ‘collection of functionally similar (original) texts (in two or more languages)’ “collection of original and translated texts in one language” (Baker 1995, cited by Ebeling 1998a) (1c) multilingual corpus “collection of functionally similar (original) texts (in two or more languages)” (Baker 1995, cited by Ebeling 1998a) (1d) translation corpus ‘collection of translationally related texts’ (1e) bitext (or bi-text) ‘translationally related texts in two languages’ (“translation mates” in Yang et al. 1998) (1f) noisy parallel corpus ‘collection of translationally related texts, but with gaps, i.e. there are source or target language segments missing’ (1g) reciprocal corpus “an equal amount of texts originally written in language A and B and their respective translations” (Teubert 1996: 245, cited by Lawson 2001: 284) Merkel (1999: 11) offers a more detailed taxonomy of those kinds of corpora which may be considered to fall in the general category of parallel corpus (all quotations in 2a–2h are from Merkel 1999: 11): (2a) diachronic corpus “For example Chaucer’s Canterbury Tales in Medieval English vs. modern English versions” (2b) transcription corpus “Transcriptions of dialect versions of a standard language text or phonetic transcription of spoken language” (2c) target variant corpus “Different translations into the same target language of the same original text” (2d) translation corpus “Source text and target text” (2e) multi-target corpus “Several target texts in different languages originating from one source text” (2f) mixed source corpus “Several parallel texts where the original is unknown” 4 Lars Borin (2g) text type corpus “Multilingual corpus containing texts from the same text genre” (2h) mixed text type corpus “Different text types in several languages, usually balanced” Texts—and here we also include spoken language production—can, in principle, differ from each other in many ways, and in many dimensions. We could say that the defining difference which singles out parallel corpus linguistics from other kinds of corpus linguistics is one of language; the texts in a parallel corpus are not all in the same language (variety).1 Thus, by a parallel corpus we understand a corpus in which there are texts in one language, or language variety, together with corresponding texts in another language, or language variety, where the relationship between the two sets of texts is one of translation equivalence, in a broad sense, or put in another way, that there is a (direct or indirect) translation relation between the texts. This covers a broad range of relationships between the language varieties involved, such as: an older and a more modern form of the ‘same’ language (2a); a spoken and a written variety of the same language (2b); a translation in the conventional sense (2d); two different translations of the same source text (2c); etc. (Numbers and letters in parentheses refer to Merkel’s type, according to the list given above). A parallel corpus need not be restricted to only two languages or language varieties. If there are more than two languages, their mutual relationships may, again, be of various kinds, covered primarily by Merkel’s types 2d, 2e and 2f. For instance, the text which is perhaps the most widely used in parallel corpus experiments because of its ready availability in a number of language versions, namely the Bible, is almost always used as a parallel corpus of type 2e, i.e. the original is normally missing from the corpus (e.g. Melamed 1998a, 1998b; Resnik et al. 1999; Borin 2000a). The other types may also, strictly speaking, involve more than two varieties, so that, e.g., the three versions of the 9th century Strasbourg Oaths—the normalised original version, together with translations into classical Latin and modern French—discussed by Calvet (1998: 102) would constitute a multilingual parallel text of type 2a. Much has been made of the fact that the target language texts in translation corpora tend to be influenced by the source language, i.e. that they exhibit what has been called ‘translationese’ (see, e.g., Gellerstam 1985; Santos 1995a; Ebeling 1998a; Malmkjær 1998; Peters and Picchi 1998; Johansson this volume). Translationese has been characterised as “deviance in translated texts induced by the source language” (Johansson and Hofland 1994: 26). The kind of deviance referred to here is not to be equated with errors in the normal sense, however. Rather, it should reveal itself in ‘odd’ choices of lexical items and syntactic …and never the twain shall meet? 5 constructions, which conceivably could be the result of both assimilation and dissimilation with respect to the source language or the source text. It is certainly true that so-called translation effects occur, and even perhaps that they are common, which is clearly demonstrated by Johansson in this volume (see also Santos 1995b; Borin and Prütz 2001). However, it does not follow from this fact alone that the target language texts are bad or even unrepresentative examples of the language. One cannot but agree with Johanssons’s conclusion that it is not a simple matter to decide on the status of such texts from a normative point of view (cf. Mauranen 1997; Lawson 2001). It is natural, in any case, that some researchers have instead turned to other ways of using multilingual corpora for studies in translation and contrastive lexicology. We may lift the restriction that there should be a translation relation between the two (or several) text sets in the corpus, and instead only require that they be equivalent in some other way, for instance subject or style (e.g., food recipes in different languages), or simply that they be representative of the general (written or spoken) language, i.e. so-called balanced corpora. In this case we will talk about a comparable corpus. Merkel’s types 2g and 2h belong in this category. Since all the texts in a comparable corpus can be original texts, there should be no translation effect in them, and the equivalences found should be ‘truer’ to the spirit of each language. Another factor which speaks in favour of working with comparable corpora, rather than with parallel corpora, is the relative scarcity of parallel text material, which is hard to come by, compared to monolingual corpus material in various languages (see, e.g., McEnery 1997). Official documents issued in bi- and multilingual countries and by international organizations, as well as technical documents from multinational companies, are thus well repre- sented—over-represented, even—among existing parallel corpora, while some text types or genres may be missing entirely or only exist in one direction (Trosterud this volume).2 We must remember, however, that the existing computational methods for working with comparable corpora are, on the whole, less developed and less effective than those for parallel corpora (see section 6, below). We have seen that there is some variety—even inconsistency—in the terminology in this field. Keeping this in mind, we will from now on avoid the more elaborate classifications mentioned above. Instead, we will simply make a general distinction between parallel and comparable corpora, in the senses just introduced, when such a distinction is called for, and in some instances, when the distinction is of less importance, we will use parallel corpus as a general term for both kinds of corpus. 6 Lars Borin 3. Each according to its kind Even though corpus linguistics is a comparatively new field, it has already split into two distinct and partly separate research traditions, and this partitioning of the field has continued in parallel corpus linguistics. I would like to state for the record, as it were, that I do not think that this reflects any kind of ideological division of the research community. It is, rather, merely the effect of researchers coming from different research traditions developing an interest in multilingual text resources for needs internal to those traditions and thus using the methods and tools most familiar and comfortable in each tradition. 3.1 Plain and computational (parallel) corpus linguistics The first—and older—tradition is that which in Scandinavia at least tends to be located in university language departments (often English departments), and in which the emphasis is on the construction and use of parallel corpora for the investigation of linguistic phenomena for such purposes as traditional lexicogra- phy, second and foreign language pedagogy, or grammatical description for human consumption. This can be seen as simply a continuation of a traditional preoccupation with textual studies among students of language. In modern corpus linguistics, one of the two languages tends to be English in this tradition (cf. Trosterud this volume). Broadly speaking, the computational tools used in this tradition are concordancers, sentence and paragraph alignment tools, and coocurrence statistics of various kinds, and quite a bit of human ‘post-editing’ is seen as a normal part of the process of utilising parallel corpora. In other words, corpora (whether parallel or not) are seen as useful sources of empirical language data, which are investigated in a venerable linguistic methodological framework. You can get a good sense of the kind of work that is being done in this tradition if you look through the volumes edited by Johansson and Oksefjell (1998) and Aijmer et al. (1996). The other tradition has emerged more recently in computational linguistics, partly as an effect of a reawakened interest in probabilistic methods in this field. Gazdar (1996), following Thompson (1983), divides computational linguistics (or Natural Language Processing, NLP) into three main subareas, (i) theory of linguistic computation, (ii) computational psycholinguistics, and (iii) applied NLP. It is within the first of these that (parallel) corpus linguistics is pursued (even though there are often practical applications—i.e., belonging in the third subarea—in the minds of the researchers working in this field). Theory of linguistic computation involves …and never the twain shall meet? 7 the study of the computational, mathematical and statistical properties of natural languages and systems for processing natural languages. It includes the development of algorithms for parsing, generation, and acquisition of linguistic knowledge; the investigation of the time and space complexity of such algorithms; the design of computationally useful formal languages (such as grammar and lexicon formalisms) for encoding linguistic knowledge; the investigation of appropriate software architectures for various NLP tasks; and consideration of the types of non- linguistic knowledge that impinge on NLP. (Gazdar 1996: 2) Let us call these two types of corpus linguistics simply corpus linguistics, referring to the (plain) general linguistic type, and computational corpus linguistics, for the type pursued in computational linguistics. computational corpus computational corpus linguistics linguistics linguistics linguistics artificial intelligence applied linguistics computer science statistics mathematical statistics Figure 1: Corpus linguistics 8 Lars Borin Like most other fields of scientific inquiry, corpus linguistics owes some of its conceptual apparatus, methods, etc. to other disciplines and their development, and sometimes influences these disciplines in return. In Figure 1, an attempt is made to graphically show the relationships between corpus linguistics and computational corpus linguistics (together creating the oval in Figure 1, the dividing line between the two being defined by the boundary of the computa- tional linguistics rectangle), on the one hand, and between these and disciplines to which they are connected by mutual influences, on the other.3 The figure is meant to illustrate corpus linguistics in a narrow sense. Thus, we disregard the use of text corpora for research in literature (or the culture, history, etc. of a language community); also, the label (applied) linguistics covers both general linguistics and the study of individual languages. Depending on various factors, such as the particular individual language and the local academic tradition, these two kinds of linguistics can be fairly alike or very different. Mathematical statistics (including probability and information theory) and (applied) statistics have been kept apart. While the former is very much a part of computational corpus linguistics, it is more rarely met with in linguistics.4 On the other hand, applied statistics is extensively used in linguistics for various ends, particularly by corpus linguists and sociolinguists. 3.2 The knowledge-acquisition problem in NLP In computational linguistics, one often hears about the “knowledge-acquisition bottleneck”. This phrase reflects the belief (held by many computational linguists) that it is unrealistic to assume that it would be possible to undertake (a) the manual encoding of the vast amounts of linguistic knowledge (especially lexical knowledge) needed in any serious applied NLP system, knowledge which furthermore (b) may not even be available for human consumption (in grammars and dictionaries) at the level of detail necessary for a computer application (especially if we consider other languages than the world’s major languages). As a sign of the times, at the 1988 COLING conference in Budapest, there was a panel devoted to a discussion of the perceived mismatch between the issues considered interesting by linguists, and those which needed to be dealt with for broad- coverage NLP to become a reality (Nagao 1988). The kind of sentiment expressed by several (but not all) participants in this panel has been very aptly described in another context: …and never the twain shall meet? 9 Children learn to swim in the water, not in a classroom. One could even get a Ph.D. in swimming and write a book about it, then jump in the water and drown. Anybody who has had four years of high school French and then gone to Paris has probably had a similar experience. The academic approach has its own value, but it does not, by itself, produce a vital living language. (Krauss 1996: 21) Nor does the academic approach lead to directly usable NLP applications, one might add. Many computational (parallel) corpus linguists have turned to automated knowledge-acquisition from corpora as a possible way out of this quandary. Consequently, methods (both probabilistic and symbolic) are borrowed from the branch of artificial intelligence known as machine learning (Mitchell 1997), but new methods are also being developed to cope with the specific learning situations encountered in the natural language domain. This research activity is sometimes referred to as ‘empirical NLP’, and it aims at the automatic acquisition of linguistic knowledge from corpora, including parallel corpora, for use in NLP systems, such as machine translation systems, parsers, multilingual IR (Information Retrieval) systems, including Web search engines, and only secondarily (if at all) for such purposes as in ‘linguistic’ corpus linguistics. This development has certainly been prompted not only by the inadequacy of academic linguistics for attaining the goals pursued by the NLP community (which in my view at times has been exaggerated, possibly for rhetorical reasons). Certainly the fact that many researchers in computational linguists have a computer science background, rather than one in linguistics, has something to do with it. However, the different aims of the computational linguistics and natural language processing communities as compared to those characteristic of theoretical linguistics have been equally important. The emphasis in computational corpus linguistics on large-scale, automatic, and fairly knowledge-poor, sometimes unsupervised, methods has engendered— almost by necessity—a lively interest in such issues as method evaluation and the construction of general formalisms and standardised tools for storing, searching and processing (very large) corpora, and in the combination of the simpler corpus-processing tools mentioned earlier with more ‘intelligent’ NLP technol- ogy such as shallow parsers. Of course, English looms large here, as well, but there is also quite an amount of work being done on many other languages and language pairs where neither language is English. This kind of ‘multilingual empirical NLP’ is a comparative newcomer on the (parallel) corpus scene, for two reasons. Firstly, computer technology is only now developing to the point where a common desktop computer is able to routinely carry out the large-scale number-crunching needed for sophisticated statistical NLP, which was previously beyond the reach of all but a few supercomputer users (see Manning and Schütze 1999). Secondly, the explosive development of the world-wide web (and
Description: