Table Of Content19 SET. 2016
Statistical Methods in Language and Linguistic Research
UNIVERSIDAD DE ANTIOOUIA
BIBLIOTECA CARLOS GAVIRIA DIAz
To the memory ofm y father
Published by Equinox Publishing Ltd.
UK: Unit S3, Kelham House, 3 Lancaster Street, Sheffield S3 8AF
USA: ISO, 70 Enterprise Orive, Bristol, CT 06010
www.equinoxpub.com
1
First published in 2013
© Pascual Cantos Gómez, 2013
All rights reserved. No part ofthis publication may be reproduced or transmitted in any
form or by any means, electronic or mechanical, including photocopying, recording or
any information storage or retrieval system, without prior permission in writing from the
publishers.
ISBN: 978-1-84553-431-8 (hardback)
978-1-84553-432-5 (paperback)
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library.
Library of Congress Cataloging-in-Publication Data
Cantos Gómez, Pascual.
Statistical methods in language and linguistic research 1 Pascual Cantos Gómez.
p. cm.
Includes bibliographical references and index.
ISBN 978-1-84553-431-8 (hb)-ISBN 978-1-84553-432-5 (pb) l. Language
and languages-Statistical methods. 2. Language and languages-Study and
teaching-Statistical methods. l. Title.
P123.C28 2009
407' .27-dc22
2009005206
Typeset by JS Typesetting Ltd, Potihcawl, Mid Glamorgan
Printed and bound in Great Britain by Lightning Source UK, Milton Keynes
Contents
Acknowledgements ix
Preface xi
l. Sorne basic i\sues
1.1 Introduction 1
1.2 Measures of central tendency 2
1.3 Proportions and percentages 4
1.4 Dispersion 5
1.5 Standard scores 10
1.6 Distributions 13
1.7 Probability 25
2. Scales and variables 35
2.1 Types of scales 35
2.2 Types of variables 38
3. Parametric versus non-parametric statistics 43
3.1 Introduction 43
3.2 Parametric tests 46
3.3 Non-parametric tests 69
4. Reducing dimensionality: multivariate statistics 89
4.1 Introduction 89
4.2 Cluster analysis 90
4.3 Discriminant function analysis 104
4.4 Factor analysis 113
4.5 Multiple regression 122
5. Word frequency lists 135
5.1 Typology and usefulness 135
5.2 Keywords 153
5.3 Text annotation 156
5.4 Comparing wordlists 167
5.5 Frequency distributions 175
viii Contents
6. Words in context 185
6.1 Concordances 185
6.2 Collocations 196
6.3 Measuring collocation strength 196
6.4 Qualitative and quantitative data analysis: an example 228
Acknowledgements
Appendices
l. Standard normal distribution 235
2. Examples of appropriate statistics 236
When I first started writing sorne of the parts now present in this book, I never
3. t-distribution 237
thought and planned writing a book like this. They were mere drafts, a simple infor
4. F-distribution 238
mal collection oflecture notes for my PhD module on Quantitative Data Analysis at
5. Pearson product-moment correlation coefficient 239
the University of Murcia. I am particularly grateful to those friends, colleagues and
6. U-distribution for the Mann-Whitney test 241
students who encouraged and persuaded me to get into this project.
7. Sign test 242
Thanks are also due to Professor Aquilino Sánchez who was responsible for first
8. Chi-square distribution 243
encouraging my interest in statistics. I am particularly indebted to Liz Murphy, who
9. Spearman rank correlation coefficient 244
has helped me refine this book; each page ofthis work has benefited from her careful
revision and analysis. Thanks, Liz, for having taken much of your time to read and
Bibliography 245
comment on drafts.
Index 253
I would also like to thank the people from Equinox Publishing (Janet Joyce,
Valerie Hall and Kate Williams, among others) for believing in this project and for
their assistance and support throughout the production of this book. Thanks also to
copy-editor Judy Napper, Hamish Ironside, Gina Manee and Christine James.
Thanks go also to m y wife Andrea and daughters, Andrea and Belén, for the many
sacrifices they made to allow me to work on this book; I would still be working on
the index if it had not been for Belén.
This book is dedicated to the memory of my father. He has always shown a
genuine interest in and concern for my life, my work and my well-being.
Preface
Statistics is known to be a quantitative approach to research. However, most of the
research done in the fields of language and linguistics is of a different kind, namely
qualitative. Succinctly, qualitative analysis differs from quantitative analysis in that
in the former no attempt is made to assign frequencies, percentages and the like to
the linguistic features found or identified in the data. In quantitative research, lin
guistic features are classified and counted, and even more complex statistical models
are constructed in order to explain these observed facts. In qualitative research,
however, we use the data only for identifying and describing features of language
usage and for providing real occurrences/examples ofparticular phenomena.
This book is an attempt to show how quantitative methods and statistical tech
niques can supplement qualitative analyses of language. We shall attempt to present
sorne mathematical and statistical properties of natural languages, and introduce
sorne ofthe quantitative methods which are ofthe most value in working empirically
with texts and corpora, illustrating the various issues with examples and moving
from the most basic descriptive techniques to decision-taking techniques and to
more sophisticated multivariate statisticallanguage models.
Among the linguistic community, statistical methods or more generally quantita
tive techniques are mostly ignored or avoided because of lack of training, and also
fear and dislike. The reasons: ( 1) these techniques are just not related to linguistics,
philology or humanities; statistics falls into the province of sciences, mathematics
and the like; and/or (2) there is a feeling that these methods may destroy the "magic"
in literary text.
There currently exist quite a few introductory statistics texts for linguists. Sorne of
these books are either ( 1) statistically and mathematically too demanding, not really
intended for linguists, (2) restricted to applied linguistics and with little statistical
coverage, or (3) not directly related to linguistics but more social science oriented.
The aim of this book is to try to illustrate with numerous examples how quantita
tive methods can most fruitfully contribute to linguistic analysis and research. In
addition, we do not intend here to offer an exhaustive pre'sentation of all statistical
techniques available to linguistics, but to demonstrate the contribution that statistics
can and should make to linguistic studies.
We have tried to keep the writing simple and have sought to avoid too much
statistical jargon whenever possible in an attempt to be user-friendly. This explains
also why we have deliberately avoided using references within the text; these are
normally given in footnotes, although an extensive reference section is given at the
end of the book for the interested reader.
xii Preface
This book presents an accessible introduction to using statistics in language and
linguistic research. It describes the mo~t pop~lar st~tistica! techniques, ~xpl~i~
ing the basic principies and demonstratmg the1r use m a w1de range of lmgmstlc
research: (1) observations and descriptions of sorne aspects of language phenomena
including the areas of applied linguistics, sociolinguistics, dialectology and so on,
as far as they can be modelled by means of quantitative mathematical methods, 1 Sorne basic issues
(2) applications of methods, models or techniques from quantitative linguistics to
problems of natural language description and language teach~ng, and (3_) methodo
logical problems of linguistic measurement, model constructwn, samplmg and test
1.1 lntroduction
theory.
The range of techniques introduced by the book will help the reader both to
There are various means that can help us to represent summarized data: frequency
evaluate and make use of literature which employs statistical analysis and apply
and contingency tables, charts or graphs (histograms or frequency polygons ).
statistics in their own research. Each chapter gives step-by-step explanations of
However, we may sometimes wish to compare data or sets of data, in order to look
particular techniques using examples from a number of linguistic fields. None of
for similarities or differences. Tables, graphs and charts may not always be useful
the techniques requires the reader to have a grasp of mathematics more complex
in this respect as it is not easy, on sorne occasions, to talk succinctly about a set
than simple algebra.
of numbers. Consider the two histograms in Figures 1.1 and 1.2, which show the
We hav e deliberately sought to write and design this book not for mathematicians
sent en ce length in words of two linguistic samples; one contains written texts and
but for linguists, in an attempt to demonstrate the use of statistical techniques in a
the other is made up from written-to-be-spoken text samples, such as speeches,
wide range of linguistic research.
conferences, and so on. In this case, a mere glance at them allows us to realize
immediately that the sentence length in written texts tends to be longer than that in
written-to-be-spoken texts.
However, the normal shape of things is that situations are rarely that neat and
clear. For example, Table 1.1 compares the data on sentence length (in words) of
written texts and written-to-be-spoken texts.
The figures relative to each variety are very similar but there are sorne slight
differences. With this simple data display, it is very hard, if not impossible, to make
any precise statement about these dissimilarities in the overall sentence length of
written texts and written-to-be-spoken texts.
Studies in our field often omit the use of graphs. However, graphs usually
just provide sorne form of description ( descriptive statistics ), that is, a numerical
25
20
Cl>
.e(.,.) 15
.C.l>.
:S
(,) 10
o(,)
5
o
1-5 6-10 11-15 16-20 21-25 26-30 31-35
Length (in words)
Figure 1.1 Sentence length of written texts.
2 Statistical methods in language and linguistic research Sorne basic issues 3
35 ~----~-----~-----------------·--····-·~------~-----·~---··--·--··--·---~----····-··------·-~----1 1.2.1 Mode
30 +-----
The mode is the val u e or score that occurs most frequently in a given set of scores.
25 +-----
(1) If we take the above data on sentence length of written and written-to-be-spoken
(,)
e 20 texts (Table 1.1) then the mode for written texts would be sixteen to twenty words,
~...
because this sentence length occurs most: twenty and twenty-two words. It would
::::1 15
(,)
o(,) be six to ten words for written-to-be-spoken texts. To keep the mode clear in your
10
mind, associate the term with its near-synonym: fashionable. Thus, the mode is the
5 score or value that is most fashionable. This score has no statistical formula and
o is straightforward. The mode is easily identifiable in frequency polygons, that is,
1-5 6-10 11-15 16-20 21-25 26-30 31-35 graphic displays of frequency tables, as it corresponds to the score with the highest
point. A distribution may have more than one mode if two or more scores occur
Length (inwords) the same number of times. Such distributions are called bimodals (two modes),
trimodals (three modes), and so on.
Figure 1.2 Sentence length of written-to-be-spoken texts.
This central tendency score is easy to use and quick to compute. However, it is not
very useful as it does not give much information about the distribution and it is easily
representation of the data that might help us in creating a mental picture of how a affected by chance scores, although for large data sets this is less likely to happen.
linguistic phenomenon performs. It is our responsibility not just to get a "picture"
of the situation but also to look at the information or data provided. Both aspects
1.2.2 Median
are important.
In any group of scores the median is the middle point or central score of the distribu
Table 1.1 Written versus written-to-be-spoken-texts. tion, with half ofthe scores lying above and halffalling below. It is that point below
1-5 6--10 11-15 16--20 21-25 26--30 31-35 which 50 per cent of the scores fall and abo ve which 50 per cent fall, dividing any
set of scores into two equal subgroups; one of these contains all the scores lower
Written texts 3 6 17 22 17 4
than the median and the other all those greater than the median. For example, in a
corpus consisting ofsentences with the following number ofwords: 15, 16, 20, 22,
Written-to-be-spoken texts 20 30 12 4 2
12, 12, 13, 15 and 14; the median is 15 (12, 12, 13, 14, 15, 15, 16, 20 and 22), as
four of the sentences are shorter (12, 12, 13 and 14) and four are longer or equal
One way to loo k at the information is to represent the data for a group of items by (15, 16, 20 and 22). If there is an even number of scores as, for example, in this
means of scores. That is, single measures that are the most typical scores for a data distribution: 12, 12, 13, 14, 15, 16, 16, 17, 20, 22, then the median is the midpoint
set . between the two middle scores: 15.5, which is calculated by adding the two middle
scores and dividing by 2:
1.2 Measures of central tendency
15; 16 = 15.5
The central tendency in di cates the typical behaviour of a data set. There are differ
ent ways or measures to estímate it. The measures of central tendency can provide
1.2.3 Mean
us with information on how a group or collection of data performed overall.
Therefore, central tendency is a representative or typical score. For example, if a
The mean, also known as the average, is the sum of all scores divided by the total
foreign language teacher is asked to provide a single value which best describes the
number of scores. Application of it to our sentence length example yields the result
performance level of her group of students, she would answer with a measure of
in Table 1.2.
central tenclency. The three measures of central tendency that will be discussed are
That is, the mean sent en ce length for the data set in Table 1.2 is 15.7 words per
the mode, the median and the mean.
sentence. The mean is defined by the formula:
4 Statistical methods in language and linguistic research Sorne basic issues 5
Table 1.2 Calculation of the mean. Table 1.3 Modal verbs in arts and science texts.
Sentence Sentence length Calculations Modal verb Arts Science
(in words)
Can 265 778
12 l. Sum of scores: Ix Could 296 307
2 12 = 12+ 12+ 13 + 14+ 15+ 16+ 16+ 17+20+22 Ma y 187 547
3 13 = 157 Might 157 113
4 14 Must 130 236
2. Number of scores: N
5 15 Ought to 12 6
= 10
6 16 Sh all 18 44
7 16 3. Mean:X Should 98 159
8 17 157 Will 174 593
9 20 = 10 = 15.7 Would 421 485
10 22
Table 1.4 Modal verbs in arts and science texts in percentages.
x LX Modal verb Arts Science Total
N
Can 265 15.07% 778 23.80% 1043 20.75%
Could 296 16.83% 307 9.39% 603 11.99%
WhereX= mean, N= number ofscores and LX= sum ofall scores.
Ma y 187 10.63% 547 16.73% 734 14.60%
The mean is the preferred measure of central tendency, both as a description of
Might 157 8.93% 113 3.45% 270 5.37%
the data andas an estimate ofthe parameter. Its major drawback is that it is sensitive
Must 130 7.39% 236 7.22% 366 7.28%
to extreme scores. Ifthe data are not normally distributed (see section 1.6.1 below), Ought to 12 0.68% 6 0.18% 18 0.35%
for example if most of the items are grouped towards the higher end of the scale, Sh all 18 1.02% 44 1.34% 62 1.23%
then the median may be a more reliable measure. Should 98 5.57% 159 4.86% 257 5.11%
Will 174 9.89% 593 18.14% 767 15.26%
Would 421 23.94% 485 14.84% 906 18.02%
1.3 Proportions and percentages Total 1758 34.97% 3268 65.03% 5026 100.00%
On many occasions data are presented as proportions or percentages. Suppose we
are interested in the distribution of a linguistic variable in relation to sorne other always appropriate, particularly whenever the data are presented only as percentages
variable: for example, the use ofmodal verbs (linguistic variable) in relation to arts and proportions without the original values being given. This is a serious error that
texts versus science texts. To simplify matters we shall concentrate on the distribu often prevents analysis of the data by another researcher. In addition, it m ay be dif
tion ofjust ten modal verbs (can, could, may, might, must, ought to, shall, should, ficult to interpret a mean proportion or percentage when it is not clear what the total
will and would). Table 1.3 shows the number of times each modal verb occurs in number of observations was over which the original proportion was measured, espe
each linguistic domain (arts and science). cially when the original values which the percentages were based on are omitted.
However, we could also represent the data as fractions of the total number of
modal verbs, in percentages (see Table 1.4). This table displays the partial percent
ages and total percentages. Ifwe take the second arts column, we see that can makes 1.4 Dispersion
up 15.07 per cent ofall modal verbs used in arts texts. Similarly, in the second totals
column we find that, for example, should makes up 5.11 per cent of all modal verbs The mean and the median can be very useful for the comparison of data sets as they
used in arts and science. Comparing arts and science, scientific communication uses help us to interpret the typical behaviour ofthe data in the form of central tendency.
many more modals than arts does: 65.03 per cent versus 34.97 per cent. If these data sets are very different, then we are likely to hav e made a significant
Percentages and proportions are particularly useful to summarize data. However, discovery. However, ifthe two means or medians are very similar, then it is difficult
we should also note that simple averages of percentages and proportions are not to make any statement on the complete set of scores. Let us consider the artificial
6 Statistical methods in language and linguistic research Sorne basic issues 7
example in Table 1.5, showing the distributions of nouns in arts and science texts Intuitively, this difference measure might give us sorne kind of information on
o ver five samples of equallength: how far apart the individual sentence length values cluster around the mean length.
Table 1.6 shows all the calculations. A close look at the figures reveals that not
all differences are positive, sorne are negative, indicating that the sentence length
Table 1.5 Nouns and verbs in arts and science texts.
of these samples are less then the mean. Furthermore, if we add all the difference
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
values, we come up with zero. This happens not just here but will occur with any
Nouns 20 29 24 15 22 data set, if the calculations are performed correctly.
Verbs 25 23 18 25 19
Table 1.6 Differences with the mean.
Comparing the mean of nouns and verbs, we get 22 for both cases, and two very Sen ten ce length (in words) Differences ( diff) with mean
similar medians: 22 (nouns) and 23 (verbs). Both the mean and the median reveal
12 -3.7
that both linguistic items are very similar in shape or structure. A closer look at
12 -3.7
the data, however, reveals that both samples have a dissimilar variability, that is, a
13 -2.7
different spread or dispersion of scores. Thus, for nouns, the highest score is 29 and
14 -1.7
the lowest 15, and data dispersion range is 15 (29- 15 + 1), whereas for verbs, we
15 -0.7
get a dispersion of 8 (25 - 18 + 1) . These two measures, 15 and 8, pro vide the range
16 0.3
of nouns and verbs respectively, giving us an idea of how the various samples vary 16 0.3
from the central tendency. 17 1.3
20 4.3
22 6.3
1.4.1 Range N= 10 'J.:.diff= o
The range is defined as the number of points between the highest score and the low
One way to avoid getting always the same calculation (zero) is squaring the differ
est one plus one (plus one in order to include the scores ofboth ends). The range is
ences, as we shall, then, get only positive figures (see Table 1.7).
a quick measure of variability, as it provides information on how data scores vary Once we hav e the squared sum differences ('ZdifF = 98.1 ), we can calculate the
from the central tendency. variance straightforwardly:
Occasionally, it m ay give a distorted picture of the data as it just represents the
extreme scores of the variation and, as a result, it is strongly affected by behaviour ¿ diff2
that may not necessarily be representative of the data set as a whole. For instance V
N
in Table 1.5, if there were a text sample with just 5 nouns, then the noun range
would change dramatically from 15 to 25 (29 - 5 + 1) . This range would not really 98.1
V 9.81
represent the behaviour of noun occurrence, as it was strongly affected by a single 10
score (5). The range should be taken cautiously just as a dispersion indicator and
should be interpreted simply as the number of points between the highest and the
The variance is thus the arithmetic mean of the squared sum differences or square
lowest scores, including both.
deviations. However, dividing by N (number of cases) is only correct whenever N
is very large. In our case, with only ten scores, it would be wise to divide the square
deviations by N- 1, instead.1 This would result in:
1.4.2 Variance and standard deviation
l:diff2
Recall our sentence length data above (section 1.2.3). As we know its mean length V
N-1
(15.7), we can calculate the difference between each sentence length value and the
mean length: 98.1
V 10.9
9
difference = X X
-
1 1
l. Generally, the denominator of N is used for a population and N-1 for a sample.