ebook img

Statistical methods in language and linguistic research PDF

137 Pages·2013·18.995 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Statistical methods in language and linguistic research

19 SET. 2016 Statistical Methods in Language and Linguistic Research UNIVERSIDAD DE ANTIOOUIA BIBLIOTECA CARLOS GAVIRIA DIAz To the memory ofm y father Published by Equinox Publishing Ltd. UK: Unit S3, Kelham House, 3 Lancaster Street, Sheffield S3 8AF USA: ISO, 70 Enterprise Orive, Bristol, CT 06010 www.equinoxpub.com 1 First published in 2013 © Pascual Cantos Gómez, 2013 All rights reserved. No part ofthis publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage or retrieval system, without prior permission in writing from the publishers. ISBN: 978-1-84553-431-8 (hardback) 978-1-84553-432-5 (paperback) British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. Library of Congress Cataloging-in-Publication Data Cantos Gómez, Pascual. Statistical methods in language and linguistic research 1 Pascual Cantos Gómez. p. cm. Includes bibliographical references and index. ISBN 978-1-84553-431-8 (hb)-ISBN 978-1-84553-432-5 (pb) l. Language and languages-Statistical methods. 2. Language and languages-Study and teaching-Statistical methods. l. Title. P123.C28 2009 407' .27-dc22 2009005206 Typeset by JS Typesetting Ltd, Potihcawl, Mid Glamorgan Printed and bound in Great Britain by Lightning Source UK, Milton Keynes Contents Acknowledgements ix Preface xi l. Sorne basic i\sues 1.1 Introduction 1 1.2 Measures of central tendency 2 1.3 Proportions and percentages 4 1.4 Dispersion 5 1.5 Standard scores 10 1.6 Distributions 13 1.7 Probability 25 2. Scales and variables 35 2.1 Types of scales 35 2.2 Types of variables 38 3. Parametric versus non-parametric statistics 43 3.1 Introduction 43 3.2 Parametric tests 46 3.3 Non-parametric tests 69 4. Reducing dimensionality: multivariate statistics 89 4.1 Introduction 89 4.2 Cluster analysis 90 4.3 Discriminant function analysis 104 4.4 Factor analysis 113 4.5 Multiple regression 122 5. Word frequency lists 135 5.1 Typology and usefulness 135 5.2 Keywords 153 5.3 Text annotation 156 5.4 Comparing wordlists 167 5.5 Frequency distributions 175 viii Contents 6. Words in context 185 6.1 Concordances 185 6.2 Collocations 196 6.3 Measuring collocation strength 196 6.4 Qualitative and quantitative data analysis: an example 228 Acknowledgements Appendices l. Standard normal distribution 235 2. Examples of appropriate statistics 236 When I first started writing sorne of the parts now present in this book, I never 3. t-distribution 237 thought and planned writing a book like this. They were mere drafts, a simple infor 4. F-distribution 238 mal collection oflecture notes for my PhD module on Quantitative Data Analysis at 5. Pearson product-moment correlation coefficient 239 the University of Murcia. I am particularly grateful to those friends, colleagues and 6. U-distribution for the Mann-Whitney test 241 students who encouraged and persuaded me to get into this project. 7. Sign test 242 Thanks are also due to Professor Aquilino Sánchez who was responsible for first 8. Chi-square distribution 243 encouraging my interest in statistics. I am particularly indebted to Liz Murphy, who 9. Spearman rank correlation coefficient 244 has helped me refine this book; each page ofthis work has benefited from her careful revision and analysis. Thanks, Liz, for having taken much of your time to read and Bibliography 245 comment on drafts. Index 253 I would also like to thank the people from Equinox Publishing (Janet Joyce, Valerie Hall and Kate Williams, among others) for believing in this project and for their assistance and support throughout the production of this book. Thanks also to copy-editor Judy Napper, Hamish Ironside, Gina Manee and Christine James. Thanks go also to m y wife Andrea and daughters, Andrea and Belén, for the many sacrifices they made to allow me to work on this book; I would still be working on the index if it had not been for Belén. This book is dedicated to the memory of my father. He has always shown a genuine interest in and concern for my life, my work and my well-being. Preface Statistics is known to be a quantitative approach to research. However, most of the research done in the fields of language and linguistics is of a different kind, namely qualitative. Succinctly, qualitative analysis differs from quantitative analysis in that in the former no attempt is made to assign frequencies, percentages and the like to the linguistic features found or identified in the data. In quantitative research, lin guistic features are classified and counted, and even more complex statistical models are constructed in order to explain these observed facts. In qualitative research, however, we use the data only for identifying and describing features of language usage and for providing real occurrences/examples ofparticular phenomena. This book is an attempt to show how quantitative methods and statistical tech niques can supplement qualitative analyses of language. We shall attempt to present sorne mathematical and statistical properties of natural languages, and introduce sorne ofthe quantitative methods which are ofthe most value in working empirically with texts and corpora, illustrating the various issues with examples and moving from the most basic descriptive techniques to decision-taking techniques and to more sophisticated multivariate statisticallanguage models. Among the linguistic community, statistical methods or more generally quantita tive techniques are mostly ignored or avoided because of lack of training, and also fear and dislike. The reasons: ( 1) these techniques are just not related to linguistics, philology or humanities; statistics falls into the province of sciences, mathematics and the like; and/or (2) there is a feeling that these methods may destroy the "magic" in literary text. There currently exist quite a few introductory statistics texts for linguists. Sorne of these books are either ( 1) statistically and mathematically too demanding, not really intended for linguists, (2) restricted to applied linguistics and with little statistical coverage, or (3) not directly related to linguistics but more social science oriented. The aim of this book is to try to illustrate with numerous examples how quantita tive methods can most fruitfully contribute to linguistic analysis and research. In addition, we do not intend here to offer an exhaustive pre'sentation of all statistical techniques available to linguistics, but to demonstrate the contribution that statistics can and should make to linguistic studies. We have tried to keep the writing simple and have sought to avoid too much statistical jargon whenever possible in an attempt to be user-friendly. This explains also why we have deliberately avoided using references within the text; these are normally given in footnotes, although an extensive reference section is given at the end of the book for the interested reader. xii Preface This book presents an accessible introduction to using statistics in language and linguistic research. It describes the mo~t pop~lar st~tistica! techniques, ~xpl~i~­ ing the basic principies and demonstratmg the1r use m a w1de range of lmgmstlc research: (1) observations and descriptions of sorne aspects of language phenomena including the areas of applied linguistics, sociolinguistics, dialectology and so on, as far as they can be modelled by means of quantitative mathematical methods, 1 Sorne basic issues (2) applications of methods, models or techniques from quantitative linguistics to problems of natural language description and language teach~ng, and (3_) methodo logical problems of linguistic measurement, model constructwn, samplmg and test 1.1 lntroduction theory. The range of techniques introduced by the book will help the reader both to There are various means that can help us to represent summarized data: frequency evaluate and make use of literature which employs statistical analysis and apply and contingency tables, charts or graphs (histograms or frequency polygons ). statistics in their own research. Each chapter gives step-by-step explanations of However, we may sometimes wish to compare data or sets of data, in order to look particular techniques using examples from a number of linguistic fields. None of for similarities or differences. Tables, graphs and charts may not always be useful the techniques requires the reader to have a grasp of mathematics more complex in this respect as it is not easy, on sorne occasions, to talk succinctly about a set than simple algebra. of numbers. Consider the two histograms in Figures 1.1 and 1.2, which show the We hav e deliberately sought to write and design this book not for mathematicians sent en ce length in words of two linguistic samples; one contains written texts and but for linguists, in an attempt to demonstrate the use of statistical techniques in a the other is made up from written-to-be-spoken text samples, such as speeches, wide range of linguistic research. conferences, and so on. In this case, a mere glance at them allows us to realize immediately that the sentence length in written texts tends to be longer than that in written-to-be-spoken texts. However, the normal shape of things is that situations are rarely that neat and clear. For example, Table 1.1 compares the data on sentence length (in words) of written texts and written-to-be-spoken texts. The figures relative to each variety are very similar but there are sorne slight differences. With this simple data display, it is very hard, if not impossible, to make any precise statement about these dissimilarities in the overall sentence length of written texts and written-to-be-spoken texts. Studies in our field often omit the use of graphs. However, graphs usually just provide sorne form of description ( descriptive statistics ), that is, a numerical 25 20 Cl> .e(.,.) 15 .C.l>. :S (,) 10 o(,) 5 o 1-5 6-10 11-15 16-20 21-25 26-30 31-35 Length (in words) Figure 1.1 Sentence length of written texts. 2 Statistical methods in language and linguistic research Sorne basic issues 3 35 ~----~-----~-----------------·--····-·~------~-----·~---··--·--··--·---~----····-··------·-~----1 1.2.1 Mode 30 +----- The mode is the val u e or score that occurs most frequently in a given set of scores. 25 +----- (1) If we take the above data on sentence length of written and written-to-be-spoken (,) e 20 texts (Table 1.1) then the mode for written texts would be sixteen to twenty words, ~... because this sentence length occurs most: twenty and twenty-two words. It would ::::1 15 (,) o(,) be six to ten words for written-to-be-spoken texts. To keep the mode clear in your 10 mind, associate the term with its near-synonym: fashionable. Thus, the mode is the 5 score or value that is most fashionable. This score has no statistical formula and o is straightforward. The mode is easily identifiable in frequency polygons, that is, 1-5 6-10 11-15 16-20 21-25 26-30 31-35 graphic displays of frequency tables, as it corresponds to the score with the highest point. A distribution may have more than one mode if two or more scores occur Length (inwords) the same number of times. Such distributions are called bimodals (two modes), trimodals (three modes), and so on. Figure 1.2 Sentence length of written-to-be-spoken texts. This central tendency score is easy to use and quick to compute. However, it is not very useful as it does not give much information about the distribution and it is easily representation of the data that might help us in creating a mental picture of how a affected by chance scores, although for large data sets this is less likely to happen. linguistic phenomenon performs. It is our responsibility not just to get a "picture" of the situation but also to look at the information or data provided. Both aspects 1.2.2 Median are important. In any group of scores the median is the middle point or central score of the distribu Table 1.1 Written versus written-to-be-spoken-texts. tion, with half ofthe scores lying above and halffalling below. It is that point below 1-5 6--10 11-15 16--20 21-25 26--30 31-35 which 50 per cent of the scores fall and abo ve which 50 per cent fall, dividing any set of scores into two equal subgroups; one of these contains all the scores lower Written texts 3 6 17 22 17 4 than the median and the other all those greater than the median. For example, in a corpus consisting ofsentences with the following number ofwords: 15, 16, 20, 22, Written-to-be-spoken texts 20 30 12 4 2 12, 12, 13, 15 and 14; the median is 15 (12, 12, 13, 14, 15, 15, 16, 20 and 22), as four of the sentences are shorter (12, 12, 13 and 14) and four are longer or equal One way to loo k at the information is to represent the data for a group of items by (15, 16, 20 and 22). If there is an even number of scores as, for example, in this means of scores. That is, single measures that are the most typical scores for a data distribution: 12, 12, 13, 14, 15, 16, 16, 17, 20, 22, then the median is the midpoint set . between the two middle scores: 15.5, which is calculated by adding the two middle scores and dividing by 2: 1.2 Measures of central tendency 15; 16 = 15.5 The central tendency in di cates the typical behaviour of a data set. There are differ ent ways or measures to estímate it. The measures of central tendency can provide 1.2.3 Mean us with information on how a group or collection of data performed overall. Therefore, central tendency is a representative or typical score. For example, if a The mean, also known as the average, is the sum of all scores divided by the total foreign language teacher is asked to provide a single value which best describes the number of scores. Application of it to our sentence length example yields the result performance level of her group of students, she would answer with a measure of in Table 1.2. central tenclency. The three measures of central tendency that will be discussed are That is, the mean sent en ce length for the data set in Table 1.2 is 15.7 words per the mode, the median and the mean. sentence. The mean is defined by the formula: 4 Statistical methods in language and linguistic research Sorne basic issues 5 Table 1.2 Calculation of the mean. Table 1.3 Modal verbs in arts and science texts. Sentence Sentence length Calculations Modal verb Arts Science (in words) Can 265 778 12 l. Sum of scores: Ix Could 296 307 2 12 = 12+ 12+ 13 + 14+ 15+ 16+ 16+ 17+20+22 Ma y 187 547 3 13 = 157 Might 157 113 4 14 Must 130 236 2. Number of scores: N 5 15 Ought to 12 6 = 10 6 16 Sh all 18 44 7 16 3. Mean:X Should 98 159 8 17 157 Will 174 593 9 20 = 10 = 15.7 Would 421 485 10 22 Table 1.4 Modal verbs in arts and science texts in percentages. x LX Modal verb Arts Science Total N Can 265 15.07% 778 23.80% 1043 20.75% Could 296 16.83% 307 9.39% 603 11.99% WhereX= mean, N= number ofscores and LX= sum ofall scores. Ma y 187 10.63% 547 16.73% 734 14.60% The mean is the preferred measure of central tendency, both as a description of Might 157 8.93% 113 3.45% 270 5.37% the data andas an estimate ofthe parameter. Its major drawback is that it is sensitive Must 130 7.39% 236 7.22% 366 7.28% to extreme scores. Ifthe data are not normally distributed (see section 1.6.1 below), Ought to 12 0.68% 6 0.18% 18 0.35% for example if most of the items are grouped towards the higher end of the scale, Sh all 18 1.02% 44 1.34% 62 1.23% then the median may be a more reliable measure. Should 98 5.57% 159 4.86% 257 5.11% Will 174 9.89% 593 18.14% 767 15.26% Would 421 23.94% 485 14.84% 906 18.02% 1.3 Proportions and percentages Total 1758 34.97% 3268 65.03% 5026 100.00% On many occasions data are presented as proportions or percentages. Suppose we are interested in the distribution of a linguistic variable in relation to sorne other always appropriate, particularly whenever the data are presented only as percentages variable: for example, the use ofmodal verbs (linguistic variable) in relation to arts and proportions without the original values being given. This is a serious error that texts versus science texts. To simplify matters we shall concentrate on the distribu often prevents analysis of the data by another researcher. In addition, it m ay be dif tion ofjust ten modal verbs (can, could, may, might, must, ought to, shall, should, ficult to interpret a mean proportion or percentage when it is not clear what the total will and would). Table 1.3 shows the number of times each modal verb occurs in number of observations was over which the original proportion was measured, espe each linguistic domain (arts and science). cially when the original values which the percentages were based on are omitted. However, we could also represent the data as fractions of the total number of modal verbs, in percentages (see Table 1.4). This table displays the partial percent ages and total percentages. Ifwe take the second arts column, we see that can makes 1.4 Dispersion up 15.07 per cent ofall modal verbs used in arts texts. Similarly, in the second totals column we find that, for example, should makes up 5.11 per cent of all modal verbs The mean and the median can be very useful for the comparison of data sets as they used in arts and science. Comparing arts and science, scientific communication uses help us to interpret the typical behaviour ofthe data in the form of central tendency. many more modals than arts does: 65.03 per cent versus 34.97 per cent. If these data sets are very different, then we are likely to hav e made a significant Percentages and proportions are particularly useful to summarize data. However, discovery. However, ifthe two means or medians are very similar, then it is difficult we should also note that simple averages of percentages and proportions are not to make any statement on the complete set of scores. Let us consider the artificial 6 Statistical methods in language and linguistic research Sorne basic issues 7 example in Table 1.5, showing the distributions of nouns in arts and science texts Intuitively, this difference measure might give us sorne kind of information on o ver five samples of equallength: how far apart the individual sentence length values cluster around the mean length. Table 1.6 shows all the calculations. A close look at the figures reveals that not all differences are positive, sorne are negative, indicating that the sentence length Table 1.5 Nouns and verbs in arts and science texts. of these samples are less then the mean. Furthermore, if we add all the difference Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 values, we come up with zero. This happens not just here but will occur with any Nouns 20 29 24 15 22 data set, if the calculations are performed correctly. Verbs 25 23 18 25 19 Table 1.6 Differences with the mean. Comparing the mean of nouns and verbs, we get 22 for both cases, and two very Sen ten ce length (in words) Differences ( diff) with mean similar medians: 22 (nouns) and 23 (verbs). Both the mean and the median reveal 12 -3.7 that both linguistic items are very similar in shape or structure. A closer look at 12 -3.7 the data, however, reveals that both samples have a dissimilar variability, that is, a 13 -2.7 different spread or dispersion of scores. Thus, for nouns, the highest score is 29 and 14 -1.7 the lowest 15, and data dispersion range is 15 (29- 15 + 1), whereas for verbs, we 15 -0.7 get a dispersion of 8 (25 - 18 + 1) . These two measures, 15 and 8, pro vide the range 16 0.3 of nouns and verbs respectively, giving us an idea of how the various samples vary 16 0.3 from the central tendency. 17 1.3 20 4.3 22 6.3 1.4.1 Range N= 10 'J.:.diff= o The range is defined as the number of points between the highest score and the low One way to avoid getting always the same calculation (zero) is squaring the differ est one plus one (plus one in order to include the scores ofboth ends). The range is ences, as we shall, then, get only positive figures (see Table 1.7). a quick measure of variability, as it provides information on how data scores vary Once we hav e the squared sum differences ('ZdifF = 98.1 ), we can calculate the from the central tendency. variance straightforwardly: Occasionally, it m ay give a distorted picture of the data as it just represents the extreme scores of the variation and, as a result, it is strongly affected by behaviour ¿ diff2 that may not necessarily be representative of the data set as a whole. For instance V N in Table 1.5, if there were a text sample with just 5 nouns, then the noun range would change dramatically from 15 to 25 (29 - 5 + 1) . This range would not really 98.1 V 9.81 represent the behaviour of noun occurrence, as it was strongly affected by a single 10 score (5). The range should be taken cautiously just as a dispersion indicator and should be interpreted simply as the number of points between the highest and the The variance is thus the arithmetic mean of the squared sum differences or square lowest scores, including both. deviations. However, dividing by N (number of cases) is only correct whenever N is very large. In our case, with only ten scores, it would be wise to divide the square deviations by N- 1, instead.1 This would result in: 1.4.2 Variance and standard deviation l:diff2 Recall our sentence length data above (section 1.2.3). As we know its mean length V N-1 (15.7), we can calculate the difference between each sentence length value and the mean length: 98.1 V 10.9 9 difference = X X - 1 1 l. Generally, the denominator of N is used for a population and N-1 for a sample.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.