ebook img

Literary detective work on the computer PDF

293 Pages·2014·3.088 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Literary detective work on the computer

Literary Detective Work on the Computer Natural Language Processing (NLP) The scope of NLP ranges from theoretical Computational Linguistics topics to highly practical Language Technology topics. The focus of the series is on new results in NLP and modern alternative theories and methodologies. For an overview of all books published in this series, please see http://benjamins.com/catalog/nlp Editor Ruslan Mitkov University of Wolverhampton Advisory Board Sylviane Cardey Inderjeet Mani Constantin Orăsan Institut universitaire de France, Yahoo! Labs, Sunnyvale, USA University of Wolverhampton Université de Franche-Comté Carlos Martín-Vide Manuel Palomar Gloria Corpas Rovira i Virgili Un., Tarragona University of Alicante University of Malaga Rada Mihalcea Khalil Simaan Robert Dale University of Michigan University of Amsterdam Macquarie University, Sydney Andrei Mikheev Richard Sproat Eduard Hovy Daxtra Technologies Google Research University of Southern Roberto Navigli Key-Yih Su California Universita di Sapienza, Roma Behaviour Design Corp. Alma Kharrat John Nerbonne Benjamin Tsou Microsoft Research University of Groningen The Hong Kong Institute of Richard Kittredge Nicolas Nicolov Education GoGenTex Inc, Ithaca Microsoft Research Yorick Wilks Lori Lamel Kemal Oflazer Florida Institute of Human LIMSI, CNRS and Machine Cognition Carnegie Mellon University, Qatar Editorial Assistant Miranda Chong University of Wolverhampton Volume 12 Literary Detective Work on the Computer by Michael P. Oakes Literary Detective Work on the Computer Michael P. Oakes University of Wolverhampton John Benjamins Publishing Company Amsterdam / Philadelphia TM The paper used in this publication meets the minimum requirements of 8 the American National Standard for Information Sciences – Permanence of Paper for Printed Library Materials, ansi z39.48-1984. Library of Congress Cataloging-in-Publication Data Oakes, Michael P. Literary Detective Work on the Computer / Michael P. Oakes. p. cm. (Natural Language Processing, issn 1567-8202 ; v. 12) Includes bibliographical references and index. 1. Computational linguistics--Research. 2. Imitation in literature. 3. Plagiarism. 4.  Linguistics--Research--Methodology. 5. Authorship--Study and teaching. I. Title. P98.5.O25 2014 006.3’5--dc23 2014007366 isbn 978 90 272 4999 9 (Hb ; alk. paper) isbn 978 90 272 7013 9 (Eb) © 2014 – John Benjamins B.V. No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publisher. John Benjamins Publishing Co. · P.O. Box 36224 · 1020 me Amsterdam · The Netherlands John Benjamins North America · P.O. Box 27519 · Philadelphia pa 19118-0519 · usa Table of contents Preface ix chapter 1 Author identification 1 1. Introduction 1 2. Feature selection 5 2.1 Evaluation of feature sets for authorship attribution 8 3. Inter-textual distances 11 3.1 Manhattan distance and Euclidean distance 12 3.2 Labbé and Labbé’s measure 14 3.3 Chi-squared distance 15 3.4 The cosine similarity measure 16 3.5 Kullback-Leibler Divergence (KLD) 18 3.6 Burrows’ Delta 18 3.7 Evaluation of feature-based measures for inter-textual distance 23 3.8 Inter-textual distance by semantic similarity 26 3.9 Stemmatology as a measure of inter-textual distance 28 4. Clustering techniques 30 4.1 Introduction to factor analysis 31 4.2 Matrix algebra 35 4.3 Use of matrix algebra for PCA 38 4.4 PCA case studies 44 4.5 Correspondence analysis 45 5. Comparisons of classifiers 47 6. Other tasks related to authorship 50 6.1 Stylochronometry 50 6.2 Affect dictionaries and psychological profiling 53 6.3 Evaluation of author profiling 58 7. Conclusion 58 vi Literary Detective Work on the Computer chapter 2 Plagiarism and spam filtering 59 1. Introduction 59 2. Plagiarism detection software 62 2.1 Collusion and plagiarism, external and intrinsic 63 2.2 Preprocessing of corpora and feature extraction 63 2.3 Sequence comparison and exact match 64 2.4 Source-suspicious document similarity measures 65 2.5 Fingerprinting 66 2.6 Language models 67 2.7 Natural language processing 68 2.8 Intrinsic plagiarism detection 70 2.9 Plagiarism of program code 73 2.10 Distance between translated and original text 74 2.11 Direction of plagiarism 76 2.12 The search engine-based approach used at PAN-13 78 2.13 Case study 1: Hidden influences from printed sources in the Gaelic tales of Duncan and Neil MacDonald 81 2.14 Case study 2: General George Pickett and related writings 83 2.15 Evaluation methods 84 2.16 Conclusion 85 3. Spam filters 86 3.1 Content-based techniques 87 3.2 Building a labeled corpus for training 87 3.3 Exact matching techniques 88 3.4 Rule-based methods 89 3.5 Machine learning 90 3.6 Unsupervised machine learning approaches 92 3.7 Other spam-filtering problems 93 3.8 Evaluation of spam filters 94 3.9 Non-linguistic techniques 94 3.10 Conclusion 97 4. Recommendations for further reading 98 chapter 3 Computer studies of Shakespearean authorship 99 1. Introduction 99 2. Shakespeare, Wilkins and “Pericles” 101 2.1 Correspondence analysis for “Pericles” and related texts 105 3. Shakespeare, Fletcher and “The Two Noble Kinsmen” 108 4. “King John” 110 Table of contents vii 5. “The Raigne of King Edward III” 111 5.1 Neural networks in stylometry 111 5.2 Cusum charts in stylometry 113 5.3 Burrows’ Zeta and Iota 116 6. Hand D in “Sir Thomas More” 118 6.1 Elliott, Valenza and the Earl of Oxford 118 6.2 Elliott and Valenza: Hand D 121 6.3 Bayesian approach to questions of Shakespearian authorship 122 6.4 Bayesian analysis of Shakespeare’s second person pronouns 127 6.5 Vocabulary differences, LDA and the authorship of Hand D 130 6.6 Hand D: Conclusions 131 7. The three parts of “Henry VI” 132 8. “Timon of Athens” 132 9. “The Puritan” and “A Yorkshire Tragedy” 133 10. “Arden of Faversham” 134 11. Estimation of the extent of Shakespeare’s vocabulary and the authorship of the “Taylor” poem 136 12. The chronology of Shakespeare 141 13. Conclusion 147 chapter 4 Stylometric analysis of religious texts 149 1. Introduction 149 1.1 Overview of the New Testament by correspondence analysis 151 1.2 Q 153 1.3 Luke and Acts 169 1.4 Recent approaches to New Testament stylometry 171 1.5 The Pauline Epistles 175 1.6 Hebrews 188 1.7 The Signs Gospel 188 2. Stylometric analysis of the Book of Mormon 190 3. Stylometric studies of the Qu’ran 198 4. Conclusion 206 chapter 5 Computers and decipherment 207 1. Introduction 207 1.1 Differences between cryptography and decipherment 208 1.2 Cryptological techniques for automatic language recognition 209 1.3 Dictionary approaches to language recognition 212 1.4 Sinkov’s test 212 viii Literary Detective Work on the Computer 1.5 Index of coincidence 213 1.6 The log-likelihood ratio 214 1.7 The chi-squared test statistic 215 1.8 Entropy of language 215 1.9 Zipf’s Law and Heaps’ Law coefficients 218 1.10 Modal token length 219 1.11 Autocorrelation analysis 220 1.12 Vowel identification 221 2. Rongorongo 224 2.1 History of Rongorongo 224 2.2 Characteristics of Rongorongo 226 2.3 Obstacles to decipherment 227 2.4 Encoding of Rongorongo symbols 227 2.5 The “Mamari” lunar calendar 228 2.6 Basic statistics of the Rongorongo corpus 228 2.7 Alignment of the Rongorongo corpus 229 2.8 A concordance for Rongorongo 231 2.9 Collocations and collostructions 233 2.10 Classification by genre 234 2.11 Vocabulary richness 237 2.12 Podzniakov’s approach to matching frequency curves 241 3. The Indus Valley texts 243 3.1 Why decipherment of the Indus texts is difficult 243 3.2 Are the Indus texts writing? 244 3.3 Other evidence for the Indus Script being writing 248 3.4 Determining the order of the Markov model 248 3.5 Missing symbols 249 3.6 Text segmentation and the log-likelihood measure 249 3.7 Network analysis of the Indus Signs 251 4. Linear A 252 5. The Phaistos disk 255 6. Iron Age Pictish symbols 256 7. Mayan glyphs 256 8. Conclusion 257 References 259 Index 281 Preface Computer stylometry is the computer analysis of writing style. This enables infer- ences to be made, especially about the sometimes disputed provenance of texts, but also about the dating of texts and also how texts reveal broad personality types. Following the PAN conferences, studies of disputed authorship, plagiarism and spam (unwarranted email campaigns) are considered together, partly because they often uncover fraudulent behaviour, but also because they are all examples of text classification: either a text is by author A or author B, or an email message is either spam or a legitimate message. The first two chapters will show many ways in which these techniques overlap, particularly regarding the question of how similar one text is to another. Chapters 3 and 4 aim to be comprehensive surveys of how com- puter stylometry has been used to examine the work of Shakespeare and the New Testament, both of great cultural significance. This book takes the standpoint that all of Shakespeare’s most famous plays were indeed written by Shakespeare, so the focus is on the so-called Shakespeare “apocrypha” – plays for which there is some historical evidence to suggest that Shakespeare might have had a hand in their composition. Examples we will look at are the “Two Noble Kinsmen”, written with John Fletcher; “Pericles”, written with George Wilkins, and “Edward III”, possibly written with Kit Marlowe. We will see how computers have been able to indicate the extent of Shakespeare’s contribution in each case. Computer stylometry has also considered the evidence for whether the handwritten fragment “Hand D” is by Shakespeare, as well as the play “Arden of Faversham”, where the association with Shakespeare is simply that it was very popular in his day, is very good, but we simply do not know who wrote it. This book also considers evidence that the writing style of Shakespeare is distinct from that of a recently popular “claimant”, Edward de Vere. Like Shakespeare, the King James version of the Bible has greatly influenced the English language. However, in the survey of computer stylometry given in Chapter 4, most of the studies considered have used texts in the original Greek of the New Testament. The findings largely agree with the beliefs of mod- ern theologians, including a “cautious preference for Q”, a possible source of the Gospels of Matthew and Luke. The vast majority of computer studies on religious texts are concerned with the New Testament, but new work is also starting to emerge on the Book of Mormon and the Qu’ran. In the final chapter, the aspect of literary detective work we consider is the decipherment of lost languages. In some x Literary Detective Work on the Computer respects, computer techniques can only scratch the surface of this, and there are difficulties in showing that a discovered script even constitutes language. However, the mathematics behind these techniques, and what these techniques do show us, are of considerable interest in themselves. The most extensive case studies in this chapter are the Rongorongo writings of Easter Island and the Indus Valley seals. I would like to thank the series editor, Prof. Ruslan Mitkov, for suggesting in the first place a book centred around disputed authorship, plagiarism and spam. I am also grateful to Harry Erwin who kindled my interest in computer studies of religious texts, through his interest in the Signs Gospel, a possible precursor of the Gospel of John. I also wish to thank Miranda Chong, for her valuable feed- back on Chapters 2 and 5. Finally, the following people were kind enough to send detailed responses to my emailed questions on their work: Raf Alvarado, Ward Elliott, Richard Forsyth, Antonius Linmans, David Mealand, Richard Sproat and Robert Valenza. Michael P. Oakes Wolverhampton, January 2014

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.