ebook img

A Statistical Linguistic Analysis of American English PDF

438 Pages·1965·19.386 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A Statistical Linguistic Analysis of American English

A STATISTICAL LINGUISTIC ANALYSIS OF AMERICAN ENGLISH JANUA LINGUARUM STUDIA MEMORIAE NICOLAI VAN WIJK DEDICATA edenda curai CORNELIS H. VAN SCHOONEVELD STANFORD UNIVERSITY SERIES PRACTICA Vili 1965 MOUTON & CO. LONDON • THE HAGUE • PARIS A STATISTICAL LINGUISTIC ANALYSIS OF AMERICAN ENGLISH by A. HOOD ROBERTS 1965 MOUTON & CO. • THE HAGUE • LONDON PARIS © Copyright 1965 Mouton & Co., Publishers, The Hague, The Netherlands. No part of this book may be translated or reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publishers. Printed in The Netherlands PREFACE Frequency studies of the components of language have always been hampered by the prodigious amount of man-hours required to make the studies extensive enough to be either valid or useful. The great semantic counts and the Lorge "Magazine Count" of the 1930's were by-products of the depression. For example, The Semantic Count of the 570 Commonest English Words by Irving Lorge employed the efforts of several hundred workers provided by the W.P.A. for a period of six years beginning in 1934. Fortunately, scores of phoneme counters were not needed to do the work on this project owing to these factors: 1. The data from the frequency counts of the past can be converted into a form more useful in present-day linguistic study. 2. Modern digital computers can now do in minutes work which would require thousands of man-hours. The original idea for this study came from Professor Frederic G. Cassidy, who suggested it to me as a possible dissertation topic and who was instrumental in obtaining the funds necessary to accomplish this investigation. I owe a great debt of gratitude to Professor Cassidy for the helpfulness and patient understanding which he showed me in his direction of this dissertation. I would like to acknowledge the grant made available for this project by the Research Committee of the Graduate School of the University of Wisconsin which enabled me to use the facilities of the Numerical Analysis Laboratory. Despite the speed of the computer used in this project, the preparatory work, which had to be done by hand, was both laborious and time-consuming and required some- thing over 2,000 man-hours. It is with pleasure that I acknowledge my indebtedness to those helpful participants in what was, for the most part, sheer drudgery. My appreciation is expressed here to those who assisted in various stages of the project: Professor Gerald B. Kelley, for his assistance in determining the phonemics of the informant. My colleague and the informant for this study, Donald C. Green, not only for his willingness to record the tapes but also for his assistance in the laborious task of checking the first print-out for errors. Mrs. Jean Walsh and my brother Hal Roberts for their hours spent in the tedious recording of data on the laboratory sheets. 6 PREFACE John Cerveny for his aid in determining the number of letters in the corpus and for his help in numerous other parts of the study. Miss Margaret Horigan for her assistance in several areas of the investigation. Miss Nancy Krahn, who did part of the programming, for her perfect cooperation and willingness to give of her time. My colleague, Richard George Wolfe, who did the major part of the programming for this project, and whose capability was matched by his eagerness to serve. Without his expenditure of time and effort, this project would still be far from completion. Professor Murray Fowler, for his willingness to serve as acting chairman of the examining committee and his aid in seeing this project through to its completion. To my wife, Carolyn Roberts, go my deepest thanks, for her understanding, assistance and encouragement during the work on this project. Without her I could not possibly have done this work and to her is owed my deepest gratitude. Western Reserve University A. H. R. CONTENTS Preface 5 I. PIONEERS IN WORD COUNTING 9 II. GENERAL PLAN OF THE PRESENT STUDY 13 III. PREPARATORY WORK 15 Choice of System of Phonemic Notation 15 Selection of Informant 15 Informant's Phonemics 16 Recording the Corpus 16 Phonemic Transcription of the Corpus 16 Recomputation of Phoneme Frequencies 17 Choice of Etymological Authority 19 Manner of Recording Etymological Sources 20 Recomputation of Etymological Frequencies 22 Choice of Word Count for Use in this Study 22 Review of the Criteria for Selection of Word Count 27 IV. PROCESSING THE DATA 31 V. RESULTS OF THE INVESTIGATION 34 Relationship between Alphabetic and Phonemic Codes 35 The Etymological Composition of English 35 Phoneme Frequencies 38 Statistical Analysis of Phoneme Frequencies 42 Vowel/Consonant Ratio 44 Word Length in Phonemes and in Syllables 44 Canonical Forms 48 Canonical Forms with Respect to Manner of Articulation 51 Canonical Forms with Respect to Points of Articulation 52 Transitional Probabilities for Sequences of Two Phonemes , , , , . 52 8 CONTENTS Transitional Probabilities for Word-Initial Phoneme Sequences . .. 57 Entropy and Redundancy 57 Initial Consonants and Consonant Clusters 60 Intervocalic Consonants and Consonant Clusters 60 Final Consonants and Consonant Clusters 61 VI. PREVIOUS STUDIES OF SPEECH SOUNDS 62 BIBLIOGRAPHY 64 APPENDICES 67 I. Etymological Composition of English 69 II. Relative Frequency of Segmental Phonemes 81 III. Relative Frequency of Vowels 95 TV. Relative Frequency of Consonants and Semivowels 100 V. Phoneme Proportions by Number and Frequency for Each Decile. . Ill VI. Word Length in Phonemes 113 VII. Word Length in Syllables 116 VIII. Joint Frequency Distribution of Word Length by Syllable and Pho- neme Number 118 IX. Canonical Forms of Consonant, Vowel, Semivowel 126 X. Canonical Forms by Manner of Articulation of Phonemes 159 XI. Canonical Forms by Place of Production of Phonemes 262 XII. Transitional Probabilities for Sequences of Two Phonemes 340 XIII. Transitional Probabilities for Word-Initial Phoneme Sequences . . . 357 XIV. Entropy and Redundancy based on Phoneme Frequencies by Decile. 395 XV. Entropy and Redundancy based on Word Length in Phonemes by Decile 396 XVI. Entropy and Redundancy based on Word Length in Syllables by Decile 397 XVII. Initial Consonants and Consonant Clusters 398 XVIII. Intervocalic Consonants and Consonant Clusters including Pre- Consonantal Off-Glides 400 XIX. Final Consonants and Consonant Clusters including Pre-Consonantal Off-Glides 411 XX. Intervocalic Consonants and Consonant Clusters 417 XXI. Final Consonants and Consonant Clusters 425 XXII. Phonemic Transcription of the First Decile 428 I PIONEERS IN WORD COUNTING Frequency studies of the components of languages have been concentrated largely on the two components, words and sounds. Some investigations have been based on counts of entries in dictionaries. This type of count reveals the overall pattern of the lexicon, but its great limitation is that it does not take into account the importance of the components as determined by their frequency of use. Other counts have been based on running words - either printed, written, or spoken. These frequency counts have been made with a variety of purposes in mind. Some have been made to deter- mine the most frequent sounds in the language; others have determined the most frequently used words in various types of reading or in spelling. Despite the variety of these latter studies, underlying them all is the principle that, to a great extent, a word's importance is measured by its frequency of occurrence. Perhaps the best known investigators of word frequency in the United States are R. C. Eldridge, Edward L. Thorndike, Irving Lorge, Ernest Horn, and G. K. Zipf. Although each of these pioneers in this field used word frequency as the basis for their studies, their purposes and interests were different. R. C. Eldridge, the manager of a factory which employed a high proportion of foreign born, was concerned with the employees' problems in learning English. His count grew out of this concern and his interest in a universal phonetic alphabet and a universal vocabulary. The sample for his count was taken from four different newspapers published in Buffalo, New York, on different dates in 1909. A word list was made from the count of the words in each newspaper, and a fifth list containing 6,000 different words out of a total of 43,989 running words was made by combining the four counts. Edward L. Thorndike, a professor of Educational Psychology at Teachers College, Columbia University, was interested chiefly in the study of reading vocabulary. His work in this field was continued for many years, and in this time he compiled three influential word lists, at intervals of a decade. His first study, The Teacher's Word Book, "...is an alphabetical list of the 10,000 words which are found to occur most widely in a count of about 625,000 words from literature for children; about 3,000,000 words from the Bible and English classics; about 300,000 words from elementary- school text books; about 50,000 words from books about cooking, sewing, farming, the trades, and the like; about 90,000 words from the daily newspapers; and about 10 PIONEERS IN WORD COUNTING 500,000 words from correspondence. Forty-one sources were used."1 With the publication of this count, Thorndike became the acknowledged leader in this area of study in the United States. However, he was not content to stop with the publication of this work, and in 1932 he published an expanded word count.2 This list added the counts from over two hundred sources including about 5,000,000 words, and incorporated the results of other published counts. Then in 1944, with Irving Lorge, Thorndike published The Teacher's Word Book of 30,000 Words.3 With the publication of this compilation, word counts of reading vocabulary had gone about as far as they were ever apt to go. Based on counts of over 20,000,000 running words, the 1944 count is regarded by many as the epitome of word counts. Perhaps the 1944 count helped foster an attitude of reverence for the work. The 1921 and 1932 counts both emphasized that they were counts of English reading, as, indeed, the 1944 count does in the "Preface" which states "This book is not final as a frequency count of English reading." However, the "Introduction" to the 1944 count begins, "Part 1 of this book is a list of words, each followed by a record of the frequency of occurrence of the word in general, and four different sets of reading matter."4 The word general perhaps has been misinterpreted by some. One is reminded of what Irving Lorge, the co-author of the 1944 count, has said: "Practically all counts that have been made show that there is no finality in word counts. The extent of the sam- pling, the choice of the materials counted (printed books or magazines, spoken vocabulary, written correspondence, compositions, or school work), the nature of the selection of materials (geographic, urban-rural) all play a part in the specification of the universe of background materials in communication."13 Irving Lorge was a professor of Education at Teachers College, Columbia Uni- versity, and a frequent collaborator with Thorndike. Lorge and Thorndike were co-authors of A Semantic Count of English Words.6 This count, based upon ap- proximately five million words, gives the total frequency of occurrence for each word (except the five hundred most frequent) and the relative occurrence of all its different meanings. The omission of the five hundred most frequent words in A Semantic Count is a serious disadvantage inasmuch as the most frequently used words are generally those which have the largest number of different meanings. In 1949 this 1 Edward L. Thorndike, The Teacher's Word Book (New York, Teachers College, Columbia University, 1921), p. iii. 2 Edward L. Thorndike, A Teacher's Word Book of the 20,000 Words Found Most Frequently and Widely in General Reading for Children and Young People (New York, Teachers College, Columbia University, 1932). 3 Edward L. Thorndike and Irving Lorge, The Teacher's Word Book of 30,000 Words (New York, Teachers College, Columbia University, 1944). 4 Ibid., p. ix. 3 Irving Lorge, "Word Lists as Background for Communication", Teachers College Record, VI (1944), p. 546. • Irving Lorge and Edward L. Thorndike, A Semantic Count of English Words (New York, Teachers College, Columbia University, 1938).

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.