PALM Plateforme d’analyse linguistique médiévale Version 0.1 User Manual February 2013 1 PALM Plateforme d’analyse linguistique médiévale Version 0.1 User Manual November 2013 Contents 1. Why PALM? 1.1 What does PALM do? 1.2 Why would you want to do this? 1.3 How does PALM lemmatise? 1.4 Why lemmatise? 1.5 What texts can be treated by PALM? 1.6 What is MEDITEXT? 1.7 How do you log in to PALM? 2. MEDITEXT: PALM’s internal Library 2.1 Browsing the Library 2.2 Finding out more about a text from the Library 2.2.1 Title 2.2.2 Lemmatised? 2.2.3 Period 2.2.4 Access level 2.2.5 Main language 2.2.6 Country of Origin 2.2.7 Author 2.2.8 Edition 2.2.9 Digitised by 2.2.10 Date 2.2.11 Notes 2.2.12 Add the text 2.3 Viewing a Library Text 2.4 Adding a Library text to your Workspace 3. Managing Your Workspace 2 3.1 Adding Texts to Your Workspace 3.2 Uploading a New Text 3.2.1 Verse/prose 3.2.2 ‘Field’/‘champ’ and ‘Text type’/‘type de texte’ 3.2.3 Title 3.2.4 Main Language 3.2.5 Country of Origin 3.2.6 Period 3.2.7 Access level 3.2.8 Text type 3.2.9 Author 3.2.10 Edition 3.2.11 Language 3.2.12 Digitised by 3.2.13 Date 3.2.14 Notes 3.2.15 Text 3.3 Managing Texts in your Workspace 3.3.2 Text details 3.3.3 View the Text 3.3.4 Modify the Text 3.3.5 Delete the Text 3.3.6 Add to the Library 4. Lemmatising a Text 4.1 The Morphosyntactic Tagging Page 4.2 The Annotator 4.3 Correcting a text in the Annotator 4.4 Correcting an annotation 4.5 Correcting all the instances of a form 4.6 Definition of lemma within PALM 4.6.1 Latin Lemma 4.6.2 Middle French Lemma 4.6.3 Middle English Lemma 4.6.4 Note on word division 4.7 Definition of Part of Speech within PALM 4.7.1 Parts of speech in Latin 4.7.2 Parts of speech in Middle French 4.7.3 Parts of speech in Middle English 3 4.7.4 Note on ‘Named Entities’ and Proper Nouns 4.8 Navigating through the text in the Annotator 4.9 Annotating a Corpus 5. Export 5.1 Exporting a Corpus 5.2 Note on Export Formats 5.2.1 TXM 6. Managing Your Account 6.1 Changing User Details 6.2 Changing Your Password 7. Administering PALM and MEDITEXT 7.1 Add a new user 7.2 User Account Management 7.3 Manage the Library 8. Digital linguistic resources provided by PALM 8.1 Electronic Lemma-Form Dictionaries 8.2 Taggers 8.3 ‘Rules’ 8.4 Development and Performance of PALM : Latin 8.5 Development and Performance of PALM : Middle French 8.6 Development and Performance of PALM : Middle English 8.7 Further technical remarks on the operation of PALM 9 PALM-MEDITEXT: List of texts 4 PALM Plateforme d’analyse linguistique médiévale Version 0.1 User Manual 1. Why PALM? 1.1 What does PALM do? PALM is an online platform which pre-treats medieval texts so that they can be analysed using software designed for the statistical and semantic analysis of texts in modern languages (often called ‘textométrie’). Although PALM includes a digitised library of medieval texts called MEDITEXT (see below 1.6 ‘What is MEDITEXT?’), this is provided to enable the user compile text corpora for statistical and semantic analysis. It emphatically does not offer online editions of these texts, many of which are in a rough digitised form. Users who wish to consult editions of scholarly quality should consult the most recently available edition for purposes of citation. Specifically, PALM provides facilities for the computer-aided annotation of text corpora by lemma (that is, the standard form of a word as it appears in a dictionary) and by part of speech. PALM has been developed for use on texts in late medieval Latin, French and English of northern French and English origin, but its architecture has been designed to permit the annotation and the development of resources for texts in any medieval language. Texts can be uploaded to PALM with little or no mark-up, or from minimally prepared XML- TEI or Word files. It can give output in a number of formats adapted for use with such software packages as Hyperbase, Lexico 3, Tramer, Analyse and TXM. 1.2 Why would you want to do this? PALM’s intended users are historians, literary scholars or philosophers who would like to make use of widely available computer tools for the statistical and semantic analysis of their 5 late medieval text corpora but have been prevented from doing so by the absence of standard spelling and by the presence of non-standard vocabulary items in their texts. For modern languages, many digital tools exist to assist the researcher in tasks ranging from simple lexical tracking, for example through concordances, to the application of statistical tools, from the identification of collocations and co-occurrences to sophisticated statistical methods such as factorial analysis. Without PALM, a researcher who wishes to apply these tools to late medieval texts must first lemmatise his or her corpus manually: grouping together both variant spellings and all inflected forms. PALM greatly eases the task of lemmatisation, making it as automatic as possible, but also providing facilities for the manual correction which is inevitably necessary for texts in these three languages. 1.3 How does PALM lemmatise? PALM lemmatises... (1) by the application of the linguistic resources it contains : digital form-lemma dictionaries; ‘taggers’ trained on annotated corpora; ‘rules’ programmed manually for each language. (2) by providing a user-friendly environment in which the user can correct this annotation and so create new linguistic resources. Text corpora annotated in PALM can then be exported in a number of formats which can be used by widely available text-analysis software designed for standardised modern languages, such as TXM, Tramer, Analyse, Lexico 3 and Hyperbase. For a technical discussion of how PALM lemmatises, see below, section 8 – ‘Digital linguistic resources provided by PALM’. 1.4 Why lemmatise? Lemmatisation is useful even for texts in modern languages. It makes it possible to perform statistical analyses and to follow usage of all the inflections of a verb, for example, something which can be very important in inflected languages such as French. Lemmatisation is even more important in treating medieval vernaculars, because of the absence of standard spelling in these languages. 6 Lemmatisation makes it possible to group together all the variant spellings of a particular lemma, and so perform statistical analyses and follow usage in a way which would be impossible without it. Even medieval Latin, where spelling variation is less marked, there are nonetheless a large number of words, often imported from a vernacular language, which also vary in spelling, particular in practical contexts close to contemporary legal or economic practice, for example. 1.5 What texts can be treated by PALM? Users can both import their own texts into PALM or make use of PALM’s internal library of texts: MEDITEXT. 1.6 What is MEDITEXT? MEDITEXT is a corpus of texts first assembled under the direction of Jean-Philippe Genet and Claude Gauvard between the 1970s and 2010. It was corrected and expanded as part of the European Research Council project ‘Signs and States’ between 2010 and 2014. It provides the basis for PALM’s internal Library. MEDITEXT, and by consequence PALM’s internal Library, contains essentially ‘political’ texts, by which we mean: texts which are associated with identified political events (speeches, letters, treatises, poems, sermons, chronicles); texts which deal in general with good or bad government; and a variety of texts addressed by the king to his subjects or by his subjects to the king (proclamations, acts, cahiers de doléances, petitions, lettres de rémission). For the moment, PALM’s internal Library contains texts of English origin (in Middle English, Middle French and Latin) and of (northern) French origin (in Middle French and Latin). We would however like to include texts of different provenance in the future. 1.7 How do you log in to PALM? PALM is accessed via the internet at the address <https://palm.tge-adonis.fr/PALM>. In order to log in to PALM you will first need to apply for a username and password. You can do this by sending an email to [email protected] or by using the ‘contact’ form on the website. 7 2. MEDITEXT: PALM’s internal Library When you first log on to PALM, you are presented with a short description of its internal Library ‘MEDITEXT’. If you want to return to this description, click on ‘Library’ and ‘Presentation’. To access Méditext: Go to the menu ‘Library’ and click on ‘Consult the Library’. Note, however, that PALM does not aim to provide digital editions of the texts it contains. Its aim is to permit the user to create a corpus of texts, which can then be pre-treated, before being exported for treatment by software designed for use with modern languages. 2.1 Browsing the Library There are over 900 text files in PALM’s Library. For a complete list of texts, see annex 2 to this manual. You can browse by short title, by language, by country of origin or by ‘period’ (divided into half centuries). If you already know the file code of a particular text in PALM, you can browse using ‘code’. We also intend to provide a search engine to explore the Library, but this option is not yet active. 2.2 Finding out more about a text from the Library To find out more about a text, right click on it, and select ‘Details’. The ‘Text Details’ screen then appears, providing basic information about the text. 2.2.1 Short Title The first field is a short title in a standard format to aid ease of identification. Note that this is not the ‘title’ of a document in a strict sense, but more a short name (including the author name) to enable the quick location of a text in the Library. For a more precise scholarly identification of the text, see the field ‘Edition’ below. The default language for a short title is French, except when it is widely known under a name in a different language. If this title is an editorial convention only, for texts in Latin and French, a French translation is suggested in brackets. For authors widely known in France, names are given in French. Alternative names of the author in English or Latin are supplied where appropriate in the field ‘Author’ below. 8 Note that long texts will be split up amongst several shorter files. Where possible, this follows the editorial or authorial subdivisions of the text, but sometimes the division is necessarily arbitrary. Typical short titles include: Magna Carta Gille de Rome, De Regimine Principum, pt. II, bk. 2 Ranulph Higden, Polychronicon, vol. viii, p. 50-100. Against the King’s Taxes Acceptation par Richard d’York du titre de Protecteur, 17 nov. 1455 More details about this standard form are given later in the manual under ‘Uploading a New Text: Short Title’ (section 3.2.2). 2.2.2 Lemmatised? This field marks whether the text in the Library has already been lemmatised or not. 2.2.3 Period The period field provide an approximate dating of a text, to make it easy for the user to find texts of around the same date. Each text is assigned to a half-century period. Where the period of composition is only known approximately, or when the composition took place over a number of years, the most probable or most significant period is selected, or if this is not known the earliest relevant period. So, if you are looking for texts over a certain number of years, it would be wise to search the period just before and just after the one you are looking for. Note that no extensive verification has been undertaken for the dating of texts. Unless otherwise noted, the date used in the edition has been accepted. 2.2.4 Access level PALM contains texts of three levels of access : (3) ‘User’, which can be seen and used by anybody ; (2) ‘Expert’, which can only be seen by advanced users of PALM ; and (1) ‘Administrator’, which can only be seen by the system administrator. For the point of view of a simple user, only level 3 texts will appear in the library. You can however set the access level of your own texts to ‘Expert’ or ‘Administrator’ to restrict access of other users of the system (see below, section 3.2, ‘Uploading a New Text’). 9 2.2.5 Main language This is the language of the majority of the text, since medieval texts often contain phrases, sentences or entire passages in different languages, and in extreme cases can be written in several languages at once. 2.2.6 Country of Origin A marker of origin as far as possible appropriate to the period of composition, in French. ‘France’ and ‘Angleterre’ for the kingdoms of the later middle ages where most of the texts in the Library were written. 2.2.7 Author This field identifies alternative versions of an author’s name, especially where he is known in several languages, or second and third authors. 2.2.8 Edition This field is designed to enable the user to identify and locate the edition or other source used for the digitisation, including manuscripts. French citation standards are followed, although the title and the author’s name are cited in the same language as in the edition. Normally only the place of publication is given, except where that could help to identify the precise edition. For examples of PALM’s citation style, see below section 3.2.10 (‘Uploading a New Text : Edition’) 2.2.9 Digitised by The name of the person or the people who digitised, corrected and uploaded the text onto PALM. 2.2.10 Date If the text is dated, the date will be marked. For the form used, see below section 3.2.13. 2.2.11 Notes 10