Long-range fractal correlations in literary corpora Marcelo A. Montemurro∗ and Pedro A. Pury† Facultad de Matem´atica, Astronom´ıa y F´ısica Universidad Nacional de C´ordoba Ciudad Universitaria, 5000 C´ordoba, Argentina 2 (February 1, 2008) 0 0 In this paper we analyse the fractal structure of long human-language records by mapping large 2 samplesoftextsontotimeseries. Theparticularmappingsetupinthisworkisinspiredonlinguistic basis in the sense that is retains the word as the fundamental unit of communication. The results n a confirm that beyond the short-range correlations resulting from syntactic rules acting at sentence J level, long-range structures emerge in large written language samples that give rise to long-range 9 correlations in theuse of words. ] h c e m I. INTRODUCTION II. THE MAPPING OF TEXTS ONTO TIME - SERIES t a The human language faculty reveals as a phenomenon t s of remarkable complexity, intimately interwovenwith all Thestatisticalanalysisoflongsymbolicstructureshad . t our superior mental functions [1]. Its role in the mod- anaccelerateddevelopmentrightafterthe firstcomplete a m elling and categorisation of factual experience by the DNA sequences started to become available, around a mind cannot be downplayed. Moreover,it has been sug- decade ago [6]. Basically, the original techniques tried - d gested that the outburst of symbolic and abstract im- to create a random walk out of the particular symbolic n agery pervading the archeological records signalling the sequence by mapping each distinct symbol onto a step o transition between the middle and upper Paleolithic pe- in an independent direction in space. Thus, in a ge- c riods, might be interpreted as the hallmark of an associ- netic sequence, made up of the concatenation of four in- [ ated, and rather sudden, transition in the linguistic per- dependent nucleotides, represented by four letters, each 1 formance of modern humans [2,3]. The evolutionary ad- instanceofthesesymbolswastranslatedontoanelemen- v vantageof complex syntactic communicationstems from tary jump along independent directions in a four dimen- 9 thecapacitythatitconferstolanguageofcodingcomplex sionalspace. Although different alternatives for the par- 3 information using moderate symbolic resources [4]. In ticulardetailsofthemappingswereputforward,allwere 1 the traditionalareasoflinguisticsmuchsuccesshasbeen devisedto keepthe originalsequence’s structure at sym- 1 attained in dissecting language structure from sentence bol level translated into the time series. Then, by ap- 0 2 level down the morphological rules of word formation. plying a set of standard statistical techniques, such as 0 However,effectivelinguisticcommunicationisacomplex power spectrum analysis, detrended fluctuation analysis / cognitiveprocess[5]where sequencesofsentencescohere or rescaled range analysis among others, it was possi- t a in the assemblage of larger meaningful structures. At ble to unravel the existence of long-range correlations in m this point the study of language becomes susceptible of biosequences [6,7]. - being complemented by using techniques that have been Written samples of natural languages, are also com- d successfully applied on other complex systems in nature plex information carriers to be read sequentially. Thus, n that also allow a variety of interdisciplinary approaches. similar techniques as those employed in genetic se- o c In particular, one question that still remains open to quence analysis were adapted with appropriate modifi- : discussionistheultimateoriginoflong-rangecorrelations cations[8–10]. Thedifferentproceduressharedonecom- v in complex information systems like the genetic code or mon point, namely, that the mappings were invariably i X human language. The main goal of this work is to ex- performed at letter level. However, it has never been r amine the existence of long-range correlationsin the use discussed in the literature so far whether the mapping a of words in written language drawing on methods from at letter level is indeed adequate in the case of human statistical time series analysis. To that end, we first de- language. It is crucial to note that the specific coding scribe the particular mapping scheme used to translate of words in some particular spelling or phonetic system thegivensequenceofwordspresentinatextsampleinto is alien to the linguistic structure of communication, as a time series. Then, we devote a section to explain the is clear by noting that a given text can be readily coded rescaled range analysis and how it can be used to infer in sign languages,for instance. Therefore,a better map- thepresenceoflong-rangecorrelationsintimeseries. Fi- ping, founded on linguistics basis, should recognise the nally,theresultsofoursystematicanalysisarepresented symbols at the minimum level intrinsic to the communi- followed by concluding remarks. cation process. In what follows we propose the simplest 1 step in that direction, which consists in taking word to- Let ξ(t), 1 ≤ t ≤ T, be the normalised sequence of kens as the fundamental units of communication. By incrementsofaprocessindiscretetime. Inourcase,ξ(t) doing this we can assert that the mapped sequences do is the normalised ordered sequence of ranks from a text not carry any structure below word level. corpus of T words, as described in the previuos section. In order to make clear the details of the actual trans- From this sequence, the record lation of a given text onto a time series, a brief detour around Zipf’s analysis is required [11]. It basically con- t X(t)= ξ(u) (3.1) sists in counting the number of occurrences of each dis- tinct word in a given text sample, and then producing uX=1 a list of all these words orderedby decreasing frequency. is constructed. Viewing the ξ(t) as spatial increments Eachwordinthislistcanbeidentifiedbyanindexequal in a one-dimensional discrete random walk, X(t) is the to its rank,r, in the list, that is, the most frequent word position of the walker from the starting point at time t. hasindexr =1,the secondinthe listisgivenr =2,and For any given integer span s > 1 and any initial time t, so on. Words that occur the same number of times are a detrended subrecord D(u,t,s), for 0 ≤ u ≤ s, can be ordered arbitrarily in their corresponding rank interval. defined as By means of Zipf’s analysis each word in the origi- u nal text can be replaced by its corresponding index r. D(u,t,s)=X(t+u)−X(t)− (X(t+s)−X(t)) . Then, at position t starting from the beginning of the s text, we have the corresponding index r(t). According (3.2) to this mapping, the whole text becomes a time series of ranks, namely {r(t)}T , where T stands as the total In this quantity, the mean s ξ(t + w)/s was sub- t=1 w=1 length of the text. This particular numericalassignment stracted to remove the trend in the subrecord. The cu- P mayseemarbitraryatfirst,andinfactadifferentchoice mulated range R(t,s) of the subrecord is defined by would certainly work as well. However, the selection of the rank as an indexing key to the words may be ren- R(t,s)= max D(u,t,s)− min D(u,t,s) (3.3) 0≤u≤s 0≤u≤s dered more natural if we think of it as the assignment that minimises the effort in lexical access in the rank- and the variance S2(t,s) of the subrecord is defined by ordered list of words in writing the whole text. When a wordisrequiredatpositiontinthetext,thenitmustbe 2 s s 1 1 picked from position r(t) in the list of words ordered by S2(t,s)= ξ2(t+w)− ξ(t+w) . (3.4) decreasing frequency. s s ! w=1 w=1 X X In order to compare different time series generated by this mapping, it is useful to rescale the series to For many time series of natural phenomena the av- zero mean and unit variance. Thence, we define the erageof the sample values of R(t,s)/S(t,s), carriedover standard deviation of the rank sample values as follows alladmissiblestartingpointstwithinthesample,follows σˆ = hr(t)2i−hr(t)i2, where the symbol h...i denotes the Hurst’s law: E[R(t,s)/S(t,s)] ∼ sH with H > 1/2. Hurst’s observation is remarkable considering the fact an arqithmetic average over the whole series of ranks. thatintheabsenceoflong-runstatisticaldependenceone Then, the quantity ξ(t) is defined as should find H = 1/2, for processes with finite variance. For example,for a stationaryGaussianprocessξ(t) with r(t)−hr(t)i ξ(t)= . (2.1) hξ(t)i = 0 and ξ2(t) = 1, Feller [16] has analytically σˆ proved that (cid:10) (cid:11) In this way, the sequence of normalised increments ξ(t) π cannowberegardedasthesequenceofelementaryjumps lim s−1/2E[R(t,s)/S(t,s)]= . (3.5) in a random walk each taken at a discrete time t. s→∞ r2 Additionally,MandelbrotandWallis[15]showedthatthe s1/2 law also applies to processes of independent incre- III. RESCALED RANGE ANALYSIS ments havinga varietyofdistributions: truncatedGaus- sian, hyperbolic, and (highly skewed) log-normal. More- Hydrology is the oldest discipline in which the pres- over, they also showed that when the increments of the ence of noncyclic very long-term dependence has been process are statistical dependent but the dependence is reported. Particularly, the rescaled range analysis was limited to the short run, the s1/2 law holds asymptot- introducedbyHaroldE.Hurst[12]whenhewasstudying ically. The effect of strong cyclic components was also theNileinordertodescribethelong-termdependenceof studied. When a white Gaussian noise (of zero mean water levels in the river and reservoirs. Later, this sta- and unit variance) is superimposed with a purely peri- tistical technique was further developed and applied by odic process of amplitude A, it can be seen that Mandelbrot and Wallis [13–15]. 2 π A −1/2 reckon 2p values of log[R(t,s)/S(t,s)] corresponding to lim s−1/2E[R(t,s)/S(t,s)]= 1+ . s→∞ 2 2 the different starting points t, which are plotted as ordi- r (cid:18) (cid:19) nates (marked by + signs). For each s, the logarithm of (3.6) the sample average of the quantities R(t,s)/S(t,s) over the different starting points is also marked in the R/S Thevalueofthelimitaswellasthespeedwithwhichthis diagram. Then, the value of H is estimated by linear limit is attained is dependent on A. When A increases, regression of logE[R(t,s)/S(t,s)] vs. logs. The slope is E[R(t,s)/S(t,s)] takes an even longer time to reach the calculated using the linear least-square method and the s1/2 law. Moreover, the transition to the asymptote is errorisevaluatedtakingtheuncertaintyintheordinates, non monotonic, but typically exhibits a series of oscilla- associatedwitheachvalueoftheabscissa(logs),equalto tions. one quarter of the amplitude between the corresponding The above comments support the key idea that extreme points. s1/2 law holds for every process for which long-term de- There are two pitfalls related to the R/S diagrams. pendence is unquestionably absentanddoesnotholdfor First, for small values of logs (shortsubrecords)there is processes exhibiting noncyclic long-term statistical de- large scattering in the values of log[R(t,s)/S(t,s)]. Sec- pendence. Thus, a Hurst exponent of H = 1/2 corre- ond,forlargevaluesoflogswehaveveryfewnonoverlap- sponds tothe vanishingofcorrelationsbetweenpastand ping subrecords, thus narrowing the R/S diagram. This futurespatialincrementsintherecord. ForH >1/2one tightening involves a deceiving evaluation of uncertainty has persistent behaviour, which means a positive incre- inthe ordinate. Hence, when fitting a straighttrendline, ment in the past will on the average lead to a positive smallandlargevaluesoflogsshouldbeneglected. Given increment in the future. Conversely a decreasing trend a value of p we have 2p subsamples each of which have inthepastimpliesontheaverageasustaineddecreasein T/2p ≥2m−p ranks. Therefore, we will retain the values the future. Correspondingly, the case H < 1/2 denotes of the span s given by 4 ≤ p ≤ m−4 as our criterion antipersistent behaviour. for fitting. In this manner, we will only consider average Inalltheabovediscussionthefocuswasonthescaling over16 or more samples and eachsubrecordwill have at properties of the ratio R(t,s)/S(t,s). It is important to least 16 ranks. notethattheintroductionofthedenominatorS(t,s)be- comesevermorerelevantforprocessesthatdeviate from the Gaussian and/or have long-term dependence [15]. A. Results from literary corpora Futhermore, the ratio R(t,s)/S(t,s) has a better sam- pling stability in the sense that the relative devia- In Fig. 1 we show the R/S diagram corresponding to tion, defined by Var[R(t,s)/S(t,s)]/E[R(t,s)/S(t,s)], the coded sequence consisting in 885534 ranks from 36 is smaller than any alternative expression used to study p plays by William Shakespeare. The total number of dif- long-term dependence. ferent ranks, associated with different words in the cor- pus, is in this case equal to 23150. The mean and stan- darddeviation(σˆ)ofthe originalsequence andthe third IV. ESTIMATION OF H IN LITERARY (M3) and fourth (M4) moments of the normalised se- CORPORA quence are shown in an inset. For simplicity’s sake, we only mark in the diagrams the minimum and maximum In this section we shall explain the implementation of pointsforagivenlogs,andthecorrespondingsampleav- the rescaled range analysis, as well as present the results erage is plotted as a small circle. The trendline is drawn obtainedafterperformingtheexperimentsontextscoded as a solid line along the points effectively used in the fit- as a normalised sequences of ranks. ting. ThemeasuredvalueofH anditserroraredisplayed As it was stated in the previous section, the method in Table I, and confirm the presence of long-rangecorre- is based on the estimation of sample averages of the ra- lationintheseries. Inthesamefigure,wecanalsoseeas tio R(t,s)/S(t,s), for proper choice of different values small squares the diagram for the sample average from of the integer span s. We shall follow the nomenclature the sequence after deleting all ranks outside the interval and prescriptions introduced by Mandelbrot and Wal- (100,2000). Strikingly, the corresponding value of H is lis [14]in the constructionof the R/S diagramsfromthe statisticallyindistinguishablefromthatoftheoriginalse- experimental data. We first compute the exponent m quence. Thisfacttellsusthatthecoreoflong-rangecor- such that 2m ≤ T < 2m+1, and then we select the value relationis neither supportedby the mostfrequent words of the span s from the decreasing sequence of integers: nor the least used. {T/2p, p = 0,1,...,m−2}. For each s, we choose the There are two additional experiments which can pro- starting points: t = qs+1, q = 0,1,...,2p −1 in or- vide information in tracking down the source of the cor- dertoconstruct2p nonoverlappingdetrendedsubrecords relatedbehaviour. Thefirstoneisasimplerandomshuf- D(u,t,s) from the Eq. (3.2). Thus, for a given value fling of all the ranks in the sequence which has the ef- of the index p, we have specified a value of logs which fect of recasting Shakespeare’s plays into a nonsensical is marked on the axis of abscissas. In this way, we can 3 realisation,keeping the same originalwordswithout dis- At this point it may be illustrative to compare a plot cernible order at any level. As is clear, after this experi- of the record X(t), the position of the walker as a func- menttheanalysismustyieldaHurstexponentindicating tion of time, both for the Shakespeare’s plays and for a the uncorrelated character of the sequence. This is cor- stochastic sequence generated with H = 0.7 by the suc- roboratedalsoinFig.1wherethesampleaverageisplot- cessive random addition method [17]. As can be seen in tedassmalltriangles. Yetmoreinterestingistheanalysis Figs. 2 and 3 both records show strong persistence that of a shuffling of Shakespeare’s plays that preserves sen- manifests itself in the long spans of averagemonotonous tencestructure,andthereforeEnglishgrammar. Thatis, behaviour in the records. by defining a sentence as the sequence of words between two periods, we can reorder them in a random fashion andthusproduceagrammaticallycorrect,thoughhardly meaningful, version of the corpus. The sample averages are plotted in this case as small diamonds. The result- ing value of H shows that grammar is not sufficient to induce long-range correlations as we can see again from Table I. Let us note that for maintaining the readability of the graphwe only included the extreme R/S points of the original sequence. FIG.2. RecordofthecodedsequencefromtheShakespeare corpus. FIG. 1. R/S diagram corresponding to the Shakespeare corpus. The linear fits are showed only for the original se- quenceand the shufflingexperiments. source original sequence truncation f sentences shuffled ranks shuffled Shakespeare a 0.687±0.040 0.658±0.036 0.574±0.035 0.524±0.020 Dickensb 0.738±0.033 0.660±0.034 0.573±0.025 0.520±0.021 Darwin c 0.745±0.045 0.678±0.043 0.576±0.033 Simon’s model d 0.550±0.040 0.519±0.032 g Markovian text e 0.533±0.028 TABLEI. Values of H from theestimation by linear regression of logE[R/S]vs. logs. a36 plays: 885534 words b56 books: 5616403 words c11 books: 1508483 words d5×106 words generated after a transient and deleting the ranks≤5 e1.2×106 words generated from table of frequencies corresponding to the Shakespearecorpus with memory of 7 letters fUnless other specification all ranks outside theinterval 100<rank<2000 were deleted from thecoded sequence gIn this case theranks outsidethe interval5<rank<10000 were deleted 4 FIG.3. Record of a fractional brownian motion generated FIG.4. R/SdiagramcorrespondingtotheDickenscorpus. with H =0.7 The linear fits are showed only for the original sequence and theshuffling experiments. In Figs. 4 and 5 we reproduce an identical analysis from two text corpora gathering a collection of works by Dickens and Darwin respectively, whose results are summarised in Table I. The Dickens sequence was ob- tained from 56 books by the author, and has 5616403 ranks (words) in length (44700 different ranks), whereas the Darwin sequence was obtained from 11 books and has a length of 1508483 ranks (30120 words in the vo- cabulary). Itisworthnoticingthatthe values ofH from the originalsequencesareindistinguishableforthesetwo texts,writteninproseandwithdifferentstyles,although they are slightly greater than the value obtained for the Shakespeare sequence. However, the three values of H corresponding to the truncation (100 < rank < 2000) are statistically equivalent. This fact suggests that the long-rangecorrelationsassociatedtowordsintheinterval consideredarea robustphenomenonoverdifferentstyles and authors. Finally, the shuffling experiments show the same behaviourfor the three authors,andare consistent with uncorrelated sequences. FIG.5. R/Sdiagram corresponding totheDarwin corpus. The linear fits are showed only for the original sequence and theshuffling at sentence level. source Shakespeare Dickens B. Tests of consistence corpus 0.687±0.040 0.738±0.033 portion 0.675±0.044 0.715±0.043 0.672±0.041 Inthissubsectionwepresentthreetestsofconsistency of our analysis. First, we performed the same construc- TABLEII. Test of consistence: Weestimated the H value tions as developed in the previous subsection over por- for portions of two corpora considered in Table I. The por- tions extracted from the original text corpora. Thus, tionsfromtheShakespearecorpuscorrespondtothefirstand we split the Shakespeare corpus into two parts of equal second half of the original source. From the Dickens corpus length and measured H for each part. On the other wehavetakenan embeddedsuccession of861038 wordsfrom hand, from the Dickens corpus we extracted a succes- an arbitrary origin in theoriginal corpus. sion of 861038 words from an arbitrary origin and then 5 measured the corresponding value of H. In Table II the ble origins for the long-range dependence have been put values from both the original corpora and the parts are forward, though the conclusions in all cases have been confronted. Values indistinguishable of H result from inferred from the observedstatistical behaviour of single the original corpus and its portions. In interpreting this literary works, and from the mapping of texts at let- result we should have in mind that Zipf’s analysis of a ter level. In the present work we use a robust mapping portion of the text generates a quite different table of with a strong footing on linguistics, and focus on the ranks from the one corresponding to the entire corpus. correlations that arise in large corpora comprising many This serves as an indication that we are quantifying a individual books with no thematic linkage. Therefore, robust phenomenon inherent to the fractal structure of the existence of correlations overarching sets of entire texts. works should be attributed to something else than ei- In the insets of Figs. 1 and 4 we present the values ther the relation between ideas expressed by the author of mean and standard deviation (σˆ) corresponding to as it was proposed in Ref. [8], or nonuniformities in the the set of ranks in the entire corpus of Shakespeare and distribution of word’s lengths and the associated densi- Dickens respectively. For the Shakespearecorpus we can ties of blank spaces as it was suggested in Ref. [10]. In read that the mean of ranks is equal to 1003, whereas particular, the latter alternative is ruled out from the σˆ =2632. On the other hand, Fig. 4 shows for the Dick- outset since our mapping does not carry information on ens corpus that the mean of ranks is equal to 1327 and word structure. σˆ = 3793. These quantities are calculated from sample valuesofaprobabilitydistributionP(r),forwhichZipf’s lawrepresentsacrudeapproximation. However,byusing more accurate analytical descriptions of the probability densityP(r), itispossibletoevaluatethestatisticalmo- ments for the whole distribution, that is, in the limit of infinitevocabulary. Inparticular,letusmentionthatthe standard deviation calculated for both the Shakespeare andDickenscorpora,usingthe analyticalexpressionsfor P(r) obtainedinRef. [18],resultfinite in the infinite vo- cabularylimit. Thereasonforthisisthatforlargevalues of r, the probability density develops a fast exponential decay in the case of single-author corpora [18]. InTableIwealsoreportthevaluesofH corresponding to corpora generated by means of stochastic processes. Particularly, we did the analysis explained above over a sequence of ranks generated by the Simon’s model [19] and other corresponding to a Markovian text [20], with a memory of seven letters. This short-range memory is enough to string out groups of few words in gram- matical order. However, as we can see from the values FIG.6. R/S diagram of theWebster’sabridged dictionary on the table, we do not obtain correlations from nei- (1913 edition). The linear fits correspond to the sequences ther type of sequence, which is consistent with the ex- without theabbreviations. pected behaviour for processes with short-range correla- tions. Consequently, our implementation of the rescaled In order to gain more insight into the origin of the range analysis allows to clearly distinguish between real attested long-range correlations, we decided to perform and stochastic version of texts [21]. anotherexperimentwithaspecialtypeoftext. Adictio- naryis a largecollectionofentries,one for eachdifferent word,arrangedinalphabeticalorder. Clearly,ingeneral, C. The Dictionary sequencesofthematicaffinityspanatmostafewentries, for small groups of words associated by common roots. Nevertheless, as is confirmed in Fig. 6 where we display So far, we still do not have enough evidence as to as- the R/S diagramgeneratedfrom the Webster’s abridged sert a precise source for the phenomenon being analysed dictionary (1913 edition), the value of H certifies the inthispaper. Wehavecharacterisedwitharobustquan- presence of long-range correlations. Table III compiles tifier the presence of long-range correlations in literary the exponent H from the entire dictionary and from the textsbeyondsentencelevel,andinfactoverrangesspan- textwithouttheabbreviations(whichindicatethegram- ningmorethanoneindividualwork. Theshufflingexper- maticalcategoryofentrywords),obtaininginbothcases iments attest where the long-range correlation does not totallyconsistentvaluesforH. Thisresultisratherstrik- originate, but say nothing on where it does. In previous ing and hints at the presence of a type of long-range or- attempts to study this phenomenon, a variety of possi- ganisationinthelayoutoflexicalentriesthatmaybecon- 6 nected with the phenomenon observed in literary works. V. CONCLUSIONS This last assertion is further supported by performing a shuffle over the whole dictionary keeping integrity at In this work we have addressed the important issue of entry level. That is, we do not disrupt the structure of the emergence of long-range correlations in human writ- entries made up of each word together with the associ- ten communication. To that end we proposed a simple ated definition, but we do mix them in a random man- mappingoftexts ontorandomwalksthatkeepsthe word ner. The resulting collection of definitions now lacks the as the basic unit of communication. Therefore, the time alphabetical ordering; however, the explicit information seriesgeneratedbythismappingretainthestructurerel- contained in the dictionary is still intact. Notwithstand- evant to the linguistic phenomenon being analysed. ing,thelong-rangeorderiscompletelyobliteratedbythe We addressed in some detail the rescaled range anal- shuffleasaccountedforbythefallintheH valuedownto ysis, which has been successfully applied to the analysis that of an uncorrelated sequence. This experiment cor- of a vast variety of time series exhibiting long-term cor- roborates that beyond the local structures derived from relations. By applying this analysis to the coded texts semantic affinity of small groups of related words, the we found conclusive evidence for long-range correlation ordered lexicon also possesses an overall macrostructure in the use of words over spans as long as the whole cor- that emerges, as a hidden layer, out of the alphabetical pora under study. It is worth noticing that the corpora order of entries. used in this work are made up of the concatenation of A close examination at the structure of the dictio- independent literary works by individual authors, and, nary reveals the source of the long-range order. Let us therefore the long-range effects must emerge as a phe- takeforexamplethefollowingsequenceofrelatedentries nomenon independent of the particular bounds of single from the Webster’s abridged dictionary (1913 edition): literaryworks. Thisobservationledustoanalysethecase advice, advisability, advisable, advisableness, advisably, of a dictionary as a special kind of text, which provided advise, advised, advising, advisedly, advisedness, advise- insightintopossiblesourcesforthelong-rangeorder. The ment, adviser, advisership, advisor, and advisory. These alphabetical order of entries in a dictionary corresponds entries form a small cohesive block of information, and only to the first visible layer of structural organisation. all their definitions share many words. In turn, these Yet,byperformingtheshufflingatentrylevelwerealised shared terms point to, possibly, remote locations in the thatthewholeofthelong-rangeorderisdependentupon dictionary where, again, words with common roots form a more complex structural layer related to the network small clusters. This process carries on to many levels of of associations among word clusters. depth, thus building a complex network of relations at As it has been convincingly shown in Ref. [22] sets word level, which may be closely related to the presence of literary works also possess higher level structures as- of long-range correlation in the succession of words in sociated with systematic patterns in word usage, which the dictionary. The entry–level shuffling destroys that mightalsogiveriseto acomplexnetworktopologyofre- emerging order, and thereby the correlations. lationsamonggroupsofwords. Thedetailedmechanisms whereby the onset of these structures takes place in lan- guage requires further interdisciplinary research. In the lightof the evidence supplied in this work,it is plausible that the ultimate source of these correlations is deeply Webster’s Dictionary H relatedto the structuralpatterns of worddistribution in original sequence 0.690±0.031 written communication. In this view, groups of words sequencewithout abbr. 0.699±0.036 associatedby affine semantic hierarchiesform a complex entries shuffled a 0.548±0.025 arrangement of cohesive clusters even over spans of en- TABLEIII. ValuesofH fortheWebster’sabridgeddictio- tire corpora, thereby giving rise to long-range order in nary (edition 1913) after coding as sequence of ranks. human written records. afrom thetext without abbreviations ACKNOWLEDGMENTS The authors are grateful to D. H. Zanette for his crit- ical reading of the manuscript and to F. A. Tamarit for useful suggestions. This work has been partially sup- portedby “Secretar´ıade Ciencia y Tecnolog´ıade la Uni- versidad Nacional de C´ordoba” (fellowship and grant 194/00). The digital texts analysed in this paper were obtained from Project Gutenberg Etext [23] 7 ∗ Electronic mail: [email protected] † Electronic mail: [email protected] [1] S. Pinker, The Language Instinct (Harper Collins, New York,2000). [2] P. Mellars, in The Origin and Diversification of Lan- guage, eds. N. G. Jabloski and L. C. Aiello (California Academy of Sciences, San Francisco, 1998) [3] I.TattersallandJ.H.Matternes,Sci.Am.282,38(Jan- uary 2000). [4] M.A.Nowak,J.B.Plotkin,andV.A.A.Jansen,Nature 404, 495 (2000). [5] A. Aknajian, R. A. Demers, A. K. Farmer, and R. M. Harnish, Linguistics. An Introduction to Language and Communication (MIT Press, Cambridge, 1992). [6] C.-K.Peng,S.V.Buldyrev,A.L.Goldberger,S.Havlin, F.Sciortino, M.Simons,andH.E.Stanley,Nature356, 168 (1992); Richard F. Voss, Phys. Rev. Lett. 68, 3805 (1992); S. V. Buldyrev and A. L. Goldberger, S. Havlin, C.-K. Peng, M. Simons, F. Sciortino, and H. E. Stanley, ibid.71, 1776 (1993); R. F. Voss, ibid.71, 1777 (1993). [7] Richard F. Voss, Fractals 2, 1 (1994). [8] Alain Schenkel, Jun Zhang, and Yi-Cheng Zhang, Frac- tals 1, 47 (1993). [9] M. Amit, Y. Shmerler, E. Eisenberg, M. Abraham, and N.Shnerb,Fractals 2, 7 (1994). [10] Werner Ebeling and Thorsten P¨oschel, Europhys. Lett. 26, 241 (1994); Werner Ebeling and Alexander Neiman, Physica A 215, 233 (1995). [11] G. K. Zipf, Human Behavior and the Principle of Least Effort (Addison-Wesley,Reading, 1949). [12] H. E. Hurst, Trans. Amer. Soc. Civil Eng. 116, 770 (1951). [13] B.B.MandelbrotandJ.R.Wallis,WaterResourcesRe- search 5, 242 (1969). [14] B.B.MandelbrotandJ.R.Wallis,WaterResourcesRe- search 5, 321 (1969). [15] B.B.MandelbrotandJ.R.Wallis,WaterResourcesRe- search 5, 967 (1969). [16] W. Feller, Ann.Math. Stat. 22, 427 (1951). [17] R. F. Voss, Random Fractal Forgeries in Fundamentals Algorithms in Computer Graphics, eds. R. Pynn and A. Skjeltorp (Plenum Press, New York,1985), pp.1–11. [18] Marcelo A. Montemurro, Physica A 300, 557 (2001) [19] Herbert A.Simon, Biometrika 42, 425 (1955). [20] Brian Hayes, Sci. Am.249, 16 (November1983). [21] A.CohenandR.N.MantegnaandS.Havlin,Fractals5, 95 (1996). [22] M.A.MontemurroandD.H.Zanette,AdvancesinCom- plex Systems, in press. [23] http://promo.net/pg 8