THE PERGAMON TEXTBOOK INSPECTION COPY SERVICE An inspection copy of any book published in the Pergamon International Library will gladly be sent to academic staff without obligation for their consideration for course adoption or recommendation. Copies may be retained for a period of 60 days from receipt and returned if not suitable. When a particular title is adopted or recommended for adoption for class use and the recommendation results in a sale of 12 or more copies, the inspection copy may be retained with our compliments. The Computation of Style AN INTRODUCTION TO STATISTICS FOR STUDENTS OF LITERATURE AND HUMANITIES by ANTHONY KENNY Master of Balliol College, Oxford PERGAMON PRESS OXFORD · NEW YORK · TORONTO - SYDNEY · PARIS · FRANKFURT Copyright © 1982 Anthony Kenny All Rights Reserved. Title. II. Series. PN73.K46 1982 807.2 81-19221 AACR2 British Library Cataloguing in Publication Data Kenny, Anthony The computation of style. 1. English literature—Style 2. Electronic digital computers I. Title 820.9 PR57 ISBN 0-08-024282-0 (Hardcover) ISBN 0-08-024281-2 (Flexicover) Printed and bound in Great Britain by A. Wheaton & Co., Exeter Preface THIS book is intended as an elementary introduction to statistics for those who wish to make use of statistical techniques in the study of literature. There are many elementary textbooks of statistics on the market, but they are all aimed at students in the natural or social sciences. In general they are unsuitable for scholars in the humanities because they presuppose too great a facility in mathematics and because they fail to emphasize or illustrate the application of statisti- cal methods to literary material. To the best of my knowledge there exists no elementary statistical text in English written with the needs and interests of the literary student specifically in mind. The reader of the present book is not assumed to have any math- ematical competence beyond a rusty memory of junior school arith- metic and algebra. The illustrations in the text have been carefully chosen to make the calculations involved as simple as possible. The exercises have been carefully selected so that the calculation of the relevant statistics will be possible for anyone who can remember how to do long division. (They will not involve, for instance, the extraction of difficult square roots.) All these problems can be solved without the aid of a slide-rule, pocket calculator, or computer. To facilitate the solutions in this way the literary texts have occasionally been slightly doctored, and the statistics obtained are not usually of any literary interest except as illustrations of technique. Had the examples been less artificially chosen, the calculation would have been cum- brous without an aid such as a calculator, but the results would have been of greater interest from a literary point of view. Any reader who proposes to become a serious student of literary statistics will have to acquire, sooner or later, a suitable calculator. However, the book is so designed that no one should need a calcula- tor to follow the argument of the text or work out the exercises. v vi The Computation of Style Calculators are now available at reasonable prices which not only facilitate the computation of statistics but also dispense with the con- sultation of statistical tables. For this reason the number of tables included in this volume has been kept down to the minimum necess- ary to introduce the various techniques involved to the reader who has not yet acquired a calculator. The present book covers approximately the same ground as that in any first-year statistics textbook: it differs from other books in that it places emphasis on those techniques which are most useful in literary contexts and in the choice of examples drawn almost entirely from literary and linguistic material. Like the readers for whom this book is intended, I came to stat- istics as a mathematical ignoramus with a purely humanistic back- ground. I therefore owe a great deal to the text-books of other writers from which I have learnt. In acknowledging my debt to them, I shall also explain that they still seemed to leave room for a textbook aimed at a specifically humanistic audience. The three textbooks I found most helpful were Donald Ary and Lucy Cheser Jacobs, Introduction to Statistics: Purposes and Pro- cedures; Evelyn Caulcott, Significance Tests; George W. Snedecor and William G. Cochran, Statistical Methods. The first is the most clearly written and easily intelligible introduction to descriptive stat- istics I have encountered; the second gives the fullest coverage, at an elementary level, to the branch of inferential statistics which is most useful to the literary scholar. Both books, however, are aimed at the social scientist and draw their examples from sociology and econ- omics and related fields. The third is considerably more advanced than the first two and progresses much further than the present volume. It contains much information that is useful to the literary statistician, but it is difficult going for someone without a mathematical back- ground, and it is particularly oriented towards the agriculture, medi- cal and biological sciences. Charles Muller's Initiation aux Methodes de la Statistique Linguis- tique is aimed at very much the same audience as the present book. However, there were a number of reasons why it seemed preferable to write a new book rather than to translate Müller for an English audience. In the first place, Muller's examples, naturally enough, are Preface vii drawn mainly from French literature and in particular from his own work on Corneille. In the second place, the book lacks exercises and would be difficult to use without the separately published Exercises $Application as an elementary textbook. But any reader of the present book will be able to read Müller with pleasure and profit. The present book, while introducing the reader to the use of stat- istical techniques in literary context, does not attempt any general evaluation of the merits of statistical stylometry; rather it attempts to put the reader in a position to make such evaluations for himself. There are a number of books canvassing the value of stylometry: a recent and lively one is A. Q. Morton's Literary Detection. Morton is an energetic and imaginative pioneer in the field, and most of his books contain a brief introduction to statistical methods; but in my view his presentation of these methods is too brisk and allusive to provide an initiation for the innumerate beginner. I am very much indebted to N. D. Thomson who eliminated many errors from an early draft of this book. Responsibility for errors which remain is, of course, my own. ANTHONY KENNY Balliol, January 1981 1 The Statistical Study of Literary Style THE beginning of the statistical study of style in modern times is commonly dated to 1851 when Augustus de Morgan suggested that disputes about the authenticity of some of the writings of St Paul might be settled by the measurement of the length, measured in letters, of the words used in the various Epistles. The first person actually to test the hypothesis that word-length might be a distinguishing characteristic of writers was an American geophysicist called T. C. Mendenhall, who expounded his ideas in a popular jour- nal, Science, in 1887. As a spectroscope analyses a beam of light into a spectrum characteristic of an element, he wrote, so 'it is proposed to analyse a composition by forming what may be called a "word spec- trum" which shall be a graphic representation of words according to their length and to their relative frequency of occurrence'. He hoped that each author might be found to be as constant in his habits as each element in its properties. 'If it can be proved that the word spectrum... of David Copperfield is identical with that of Oliver Twist, of Barnaby Rudge, of Great Expectations... and that it differs sensibly from that of Vanity Fair, or Eugene Aram, or Robinson Cru- soe ... then the conclusion will be tolerably certain that when it appears it means Dickens.' Mendenhall took several authors and constructed their word-spec- tra, or what would nowadays be called frequency distributions of word-length. He studied a number of samples of Oliver Twist and found great similarities between them. Three-letter words were always the commonest; this was true also in 10,000 words studied from Vanity Fair. But when two samples of J. S. Mill's work were investi- gated, it was found that the two-letter words were the most popular. 2 The Computation of Style The average length of word in Mill, however, was longer than that in either Dickens or Thackeray: it was 4.775, compared with Dickens' 4.312 and Thackeray's 4.481. It was longer also than either part of the Authorized Version, where the average was 4.604 for the Old Testa- ment and 4.625 for the New. De Morgan seems to have thought that the average word-length by itself might be an indication of author- ship; but Mendenhall's studies showed that texts with the same aver- age word-length might possess different spectra. The comparison between Mill and other authors provided a striking example of this. It was by comparing the whole spectrum of word-length prefer- ences that Mendenhall hoped to offer a scientific solution to disputes about authorship. His most ambitious attempt in this direction was a study of the Shakespeare-Bacon controversy in the Popular Science Monthly in 1901. With a grant from a Boston philanthropist, Men- denhall employed two secretaries and a counting machine to analyse some 400,000 words of Shakespeare, 200,000 words of Bacon and quantities of text from authors of other periods. He found that Shak- espeare's 'characteristic curve' was very consistent from sample to sample, whether poetry or prose; it had unusual features. 'Shakes- peare's vocabulary consisted of words whose average length was a trifle below four letters, less than that of any writer of English before studied; and his word of greatest frequency was the four-letter word, a thing never met with before.' These characteristics marked Shakes- peare out from most of his contemporaries just as sharply as from the 19th-century authors previously studied. In particular his character- istic curve differed from that of Bacon. But, to his consternation, Mendenhall discovered that 'in the characteristic curve of his plays Christopher Marlowe agrees with Shakespeare about as well as Shakespeare agrees with himself. While Mendenhall had been using stylistic statistics in the attempt to solve attribution problems in English, European scholars had been developing stylometric techniques for Greek in order to settle the disputed chronology of Plato's dialogues. Lewis Campbell, a Balliol classicist who became Professor of Greek at St Andrews, offered in his 1867 edition of the Sophist and Politicus a battery of stylistic tests which he believed to be indicators of relative date: word order, rhythm, the avoidance of hiatus, and especially 'originality of vocabu- The Statistical Study of Literary Style 3 lary' as measured by the frequency of hapax legomena or once-occur- ring words, which he noted and counted carefully. He concluded that the Sophist and Politicus belonged to a late period in Plato's life, after the Republic and close to the Laws. His work passed unnoticed for nearly 30 years, but the German philologist Ritter reached similar conclusions by similar methods in 1888. The work of these and other Platonic scholars was magisterially synthesized by the Polish scholar W. Lutoslawski in his work of 1897, The Origin and Growth of Plato's Logic. Lutoslawski lists 500 different facets of style which he believed could be used as stylometric indi- cators, and he was insistent on the importance of complete enumera- tion and numerical precision. For instance among the features he listed are the following: 318. Answers denoting subjective assent less than once in 60 answers. 325. Superlatives in affirmative answers more than half as frequent as positives, but not prevailing over positives. 378. Interrogations by means of or a between 15 and 24% of all interrogations. 412. Peri placed after the word to which it belongs forming more than 20% of all occurrences of peri. He went so far as to formulate an explicit hypothesis which governed his stylometric investigations. He called it the 'Law of stylistic affin- ity', and he enunciated it thus: Of two works of the same author and of the same size, that is nearer in time to a third, which shares with it the greater number of stylistic peculiarities, pro- vided that their different importance is taken into account, and that the number of observed peculiarities is sufficient to determine the stylistic character of all the three works. The total number of peculiarities to be considered, he reckoned, should be not less than 150 in a dialogue of ordinary length. The stylistic features studied by these Platonic scholars were much more subtle and sophisticated than the crude measure of word-length investigated by Mendenhall. On the other hand, Mendenhall had a surer grasp of statistical principles than Lutoslawski, despite the im- pressiveness of some of the latter's numerical tables. As for their

