Table Of Content

Artificial sequences and complexity measures Andrea Baronchelli1∗, Emanuele Caglioti2† and Vittorio Loreto1‡ 1”La Sapienza” University, Physics Department, P.le A. Moro 5, 00185 Rome, Italy and INFM-SMC, Unità di Roma1. and 2 ”La Sapienza” University, Mathematics Department, P.le A. Moro 5, 00185 Rome, Italy 6 0 In this paper we exploit concepts of information theory to address the fundamental problem 0 of identifying and defining the most suitable tools to extract, in a automatic and agnostic way, 2 informationfromagenericstringofcharacters. Weintroduceinparticularaclassofmethodswhich n use in a crucial way data compression techniques in order to define a measure of remoteness and a distance between pairs of sequences of characters (e.g. texts) based on their relative information J content. We also discuss in detail how specific features of data compression techniques could be 5 used to introduce the notion of dictionary of a given sequence and of Artificial Text and we show 2 how these new tools can be used for information extraction purposes. We point out the versatility andgeneralityofourmethodthatappliestoanykindofcorporaofcharacterstringsindependently ] of thetype of coding behind them. Weconsider as a case study linguistic motivated problems and h we present results for automatic language recognition, authorship attribution and self consistent- c e classification. m - I. INTRODUCTION One of the main approach to this problems, the one t a we address in this paper, is that of information theory t s (IT) [1, 2] and in particular the theory of data compres- One of the most challenging issues of recent years is t. presented by the overwhelming mass of available data. sion. a In a recent letter [3] a method for context recognition m Whilethisabundanceofinformationandtheextremeac- andcontextclassificationofstringsofcharactersorother cessibilitytoitrepresentsanimportantculturaladvance, - equivalentcodedinformationhasbeenproposed. There- d it raises on the other hand the problem of retrieving rel- n evant information. Imagine entering the largest library motenessbetweentwosequencesAandB wasestimated o in the world, seeking all relevant documents on your fa- by zipping a sequence A+B obtained by appending the c sequence B after the sequence A and exploiting the fea- vorite topic. Without the help of an efficient librarian [ tures of data compression schemes like gzip (whose core this wouldbe adifficult, perhapshopeless,task. The de- 2 siredreferenceswouldlikely remainburiedunder tons of is provided by the Lempel-Ziv 77 (LZ77) algorithm [4]). v This idea was used for authorship attribution and, by irrelevancies. Clearly the need for effective tools for in- 3 defining a suitable distance between sequences, for lan- formationretrievalandanalysisisbecomingmoreurgent 3 guages phylogenesis. as the databases continue to grow. 2 Theideaofappendingtwofilesandzippingtheresult- 3 First of all let us consider some among the possible ingfileinordertomeasuretheremotenessbetweenthem 0 sourcesofinformation. Innaturemanysystemsandphe- 4 had been previously proposed by Loewenstern et al. [5] nomena are often represented in terms of sequences or 0 (using zdiff routines) who applied it to the analysis of strings of characters. In experimental investigations of / DNA sequences, and by Khmelev [6] who applied the t physical processes, for instance, one typically has access a methodtoauthorshipattribution. Similarmethodshave m to the system only through a measuring device which been proposed by Juola [7], Teahan [8] and Thaper [9]. produces a time record of a certain observable, i.e. a - In this paper we extend the analysis of [3] and we d sequence of data. On the other hand other systems are describe in details the methods to define and measure n intrinsically described by string of characters, e.g. DNA o and protein sequences, language. theremoteness(orsimilarity)betweenpairsofsequences c based on their relative informatic content. We devise in When analyzing a string of characters the main ques- : particular, without loss of generality with respect to the v tion is to extract the information it brings. For a DNA i nature of the strings of characters,a method to measure X sequencethiswouldcorrespond,forinstance,totheiden- this distance based on data-compression techniques. tification of the subsequences codifying the genes and r Theprincipaltoolfortheapplicationofthesemethods a their specific functions. On the other hand for a written isthe LZ77algorithm,which,roughlyspeaking,achieves text one is interested in questions like recognizing the the compression of a file exploiting the presence of re- language in which the text is written, its author or the peated subsequences. We introduce (see also [10]) the subject treated. notion of dictionary of a sequence, defined as the set of all the repeated substrings found by LZ77 in a sequential parsing of a file, and we refer to these substrings as dictionary’swords. Besidesbeing ofgreatintrinsic inter- ∗[email protected] †[email protected] est, every dictionary allows for the creation of Artificial ‡[email protected] texts (AT) obtained by the concatenation of random ex- 2 tracted words. In this paper we discuss how comparing data. The debate is wide open. AT, instead of the original sequences, could represent a The outline of the paper is as follows. In section II, valuable and coherent tool for information extraction to afterashorttheoreticalintroduction,werecallhowdata be used in very different domains. We then propose a compression techniques could be used to evaluate en- general AT comparison scheme (ATC) and show that it tropic quantities. In particular we recall the definition yields to remarkable results in experiments. of the LZ77 [4] compression algorithm and we address Wehavechosenforourtestssometextualcorporaand theproblemofusingittoevaluatequantitiesliketherel- wehaveevaluatedourmethod onthe basisofthe results ative entropy between two generic sequences as well as obtained on some linguistic motivated problems. Is it to define a suitable distance between them. In section possibletoautomaticallyrecognizethelanguageinwhich III we introduce the concept of Artificial Text (AT) and a given text is written? Is it possible to automatically presentamethodforinformationextractionbasedonAr- guess the author and the subject of a given text? And tificial Text comparison. Sections IV and V are devoted finally is it possible to define methods for the automatic to the results obtained with our method in two differ- classification of the texts of a given corpus? ent contexts: the recognitionand extractionof linguistic The choice of the linguistic framework is justified by features (sec. IV) and the self-consistentclassificationof the fact that this is a field where anybody could be able large corpora (sec. V). Finally section VI is devoted to tojudge,atleastpartially,aboutthevalidityandtherel- the conclusions and to a short discussion about possible evanceoftheresults. Sinceweareintroducingtechniques perspectives. for which a benchmark does not exist it is important to checktheirvaliditywithknownandcontrolledexamples. II. COMPLEXITY MEASURES AND DATA This does not mean that the range of applicability is re- COMPRESSION duced to linguistics. On the contrary the ambition is to provide physicists with tools which could parallel other standard tools to analyze strings of characters. Before entering in the details of our method let us briefly recall the definition of entropy of a string. Shan- Inthis perspectiveitisworthwhilerecallingheresome non’s definition of information entropy is indeed a prob- of the last developments of sequence analysis in physics abilistic concept referring to the source emitting strings related problems. A first field of activity [11, 12] is that of characters. of segmentation problems, i.e. cases in which a unique Consider a symbolic sequence (σ σ ...), where σ is stringmustbepartitionedintosubsequencesaccordingto 1 2 t the symbol emitted at time t and each σ can assume some criteria to identify discontinuities in its statistical t one of m different values. Assuming that the sequence is properties. A classical example is that of the separation stationary we introduce the N−block entropy: of coding and non-coding portions in the DNA but the analysisofgeneticsequences ingeneralrepresentsa very H =− p(W )lnp(W ) (1) rich source of segmentation problems (see, for instance, N X N N [10, 13, 14, 15]). {WN} Amorerecentareaisrepresentedbytheuseofdatacom- where p(W ) is the probability of the N-word W = N N pressiontechniquestotestspecificpropertiesofsymbolic (σ σ ... σ ), and ln=log . The differential en- sequences. In [16], the technology behind adaptive dic- t t+1 t+N−1 e tropies tionarydatacompressionalgorithmsisusedinasuitable way(which is veryclose to our approach)asanestimate of reversibility of time series, as well as a statistical like- h =H −H (2) N N+1 N lihood test. Another interesting field is related to the problem of the generationof random numbers. In [17] it have a rather obvious meaning: hN is the average infor- is outlined the importance of suitable measures of con- mation supplied by the (N +1)-th symbol, provided the ditional entropies in order to check the real level of ran- N previousonesareknown. Notingthattheknowledgeof domness of random numbers, and an entropic approach a longer past history cannot increase the uncertainty on is used to discuss some random number generatorshort- the next outcome, one has that hN cannot increase with comings (see also [18]). N i.e. hN+1 ≤ hN. With these definitions the Shannon Finally, another area of interest is represented by the entropy for an ergodic stationary process is defined as: use of data compression techniques to estimate entropic quantities (e.g. Shannon entropy, Algorithmic Complex- H N ity, Kullback-Leibler divergence etc.) Even though not h= lim hN = lim . (3) N→∞ N→∞ N new this area is still topical [19, 20]. A specific ap- plication that has generated an interesting debate has It is easy to see that for a k-th order Markov pro- been drawnabout the analysis of electroencephalograms cess (i.e. such that the conditional probability to have ofepilepsypatients[21,22,23]. Inparticularinthesepa- a given symbol only depends on the last k symbols, per it is argued that measures like the Kullback-Leibler p(σ |σ σ ,...) = p(σ |σ σ ,...,σ ), then t t−1 t−2 t t−1 t−2 t−k divergence could be used to spot information in medical h =h for N ≥k. N 3 The Shannon entropy h measures the average amount Original sequence of information per symbol and it is an estimate of the qwhhABCDhhABCDzABCDhhz... “surprise” the source emitting the sequence reserves to us. It is remarkable the fact that, under rather natural assumptions, the entropy H apart from a multiplica- N Zipped sequence tive factor, is the unique quantity which characterizes the “surprise” of the N-words [24]. Let’s try to explain qwhhABCDhh(6,4)z(11,6)z... in which sense entropy can be considered as a measure of a surprise. Suppose that the surprise one feels upon learning that an event E has occurred depends only on FIG. 1: Scheme of the LZ77 algorithm: The LZ77 algo- the probability of E. If the eventoccurs with probability rithm works sequentially and at a generic step looks in the 1(sure)oursurpriseinitsoccurringwillbe zero. Onthe look-ahead buffer for substrings already encountered in the other hand if the probability of occurrence of the event buffer already scanned. These substrings are substituted by apointer(d,n)whered isthedistanceofthepreviousoccur- E isquite smalloursurprisewillbe proportionallylarge. rence of the same substring and n is its length.Only strings For a single event occurring with probability p the sur- longer than two characters are substituted in theexample. priseisproportionaltolnp. Let’sconsidernowarandom variable X, which can take N possible values x ,...,x 1 N withprobabilitiesp ,...,p ,the expectedamountofsur- 1 N K (W ) of a sequence W of N characters and H : N N N N prise we shall receive upon learning the value of X is givenprecisely by the entropyof the source emitting the random variable X, i.e. − pilnpi. 1 1 h P hK i= K (W )P(W )−−−−→ (4) Thedefinitionofentropyiscloselyrelatedtoaveryold N N N X N N N N→∞ ln2 WN problem, that of transmitting a message without loos- ing information, i.e. the problem of the efficient encod- where K is the binary length of the shorter program N ing [25]. needed to specify the sequence W . N Asaconsequenceitexistsarelationbetweenthemax- AgoodexampleistheMorsecode. IntheMorsecodea imum compression rate of a sequence (σ σ ...) ex- text is encoded with two characters: line and dot. What 1 2 pressed in an alphabet with m symbols, and h. If the is the best way to encode the characters of the English length N of the sequence is large enough, then it is not language (provided one can define a source for English) possible to compress it into another sequence (with an with sequences of dots and lines? The idea of Morse alphabet with m symbols) whose size is smaller than was to encode the more frequents characters with the Nh/lnm. Therefore, noting that the number of bits minimum numbers of characters. Therefore the e which needed for a symbol in an alphabet with m symbol is is the most frequent English letter is encoded with one lnm, one has that the maximum allowed compression dot (·), while the letter q is encoded with three lines and rate is h/lnm [1]. one dot (−−·−). Though the maximal theoretical limit of the Algorith- The problem of the optimal coding for a text (or an mic Complexity is not achievable, there are nevertheless image or any other kind of information) has been enor- algorithmsexplicitlyconceivedtoapproachit. Theseare mously studied. In particular Shannon [1] showed that the file compressorsor zippers. A zipper takes a file and there is a limit to the possibility to encode a given se- tries to transform it in the shortest possible file. Obvi- quence. This limit is the entropy of the sequence. ously this is not the best way to encode the file but it represents a good approximation of it. This result is particularly important when the aim is A great improvement in the field of data compression the measure of the information content of a single fi- has been represented by the Lempel and Ziv algorithm nite sequence, without any reference to the source that (LZ77) [4] (used for instance by gzip and zip). It is in- emitted it. In this case the reference framework is the teresting to briefly recall how it works (see fig. 1). Let Algorithmic Complexity Theory and the basic concept x=x ,....,x , the sequence to be zipped, where x rep- is Chaitin - Kolmogorov entropy or Algorithmic Com- 1 N i resents a generic character of sequence’s alphabet. The plexity (AC) [26, 27, 28, 29]: the entropy of a string LZ77algorithmfindsduplicatedstringsintheinputdata. of characters is the length (in bits) of the smallest pro- Thesecondoccurrenceofastringisreplacedbyapointer gram which produces as output the string and stops af- to the previous string givenby two numbers: a distance, terwords. This definition is really abstract. In particular representing how far back into the window the sequence itisimpossible,eveninprinciple,to findsuchaprogram starts, and a length, representing the number of charac- andasaconsequencethealgorithmiccomplexityisanon tersforwhichthesequenceisidentical. Morespecifically computable quantity. This impossibility is relatedto the the algorithm proceeds sequentially along the sequence. halting problem and to Godel’s theorem [30]. Letussupposethatthefirstncharactershavebeencod- It is important to recall how it exists a rather im- ified. Then the zipper looks for the largest integer m portant relation between the Algorithmic Complexity such that the string x ,...,x already appeared in n+1 n+m 4 x ,...,x . Then it codifies the string found with a two- plied to a sequence emitted by A will be asymptotically 1 n numbercodecomposedby: thedistancebetweenthetwo (i.e. inthe limitofanavailableinfinite sequence)ableto stringsandthelengthmofthestringfound. Ifthezipper encode the sequence almostoptimally, i.e. coding on av- does not find any match then it codifies the first charac- erageeverycharacterwith−p log p−(1−p) log (1−p) 2 2 ter to be zipped, x , with its name. This eventuality bits (the Shannon entropy of the source). This opti- n+1 happens for instance when codifying the first characters mal coding will not be the optimal one for the sequence of the sequence, but this event becomes very infrequent emitted by B. In particular the entropy per charac- as the zipping procedure goes on. ter of the sequence emitted by B in the coding opti- Thiszipperisasymptoticallyoptimal: i.e. ifitencodes mal for A (i.e. the cross-entropy per character) will be a text of length L emitted by an ergodic source whose −qlog p−(1−q)log (1−p)while the entropyper char- 2 2 entropy per characteris h, then the length of the zipped acter of the sequence emitted by B in its optimal coding file divided by the length of the original file tends to h is −qlog q−(1−q)log (1−q). The number of bits per 2 2 whenthelengthofthetexttendsto∞. Theconvergence character waisted to encode the sequence emitted by B to this limit is slow and the corrections has been shown with the coding optimal for A is the relative entropy of to behave as O loglogL [31]. A and B, (cid:16) logL (cid:17) Usually, in commercial implementations of LZ77 (like forinstancegzip),substitutionsaremadeonlyifthetwo p 1−p d(A||B)=−qlog −(1−q)log (5) identicalsequencesarenotseparatedbymorethanacer- 2q 21−q tain number n of characters, and the zipper is said to w A linguistic example will help to clarify the situation: have a n -long sliding window. The typical value of n w w transmittinganItaliantextwithaMorsecodeoptimized is 32768. The main reasonfor this restrictionis that the forEnglishwillresultintheneedoftransmittinganextra search in very large buffers could be not efficient from number of bits with respect to another coding optimized the computational time point of view. for Italian: the difference is a measure of the relativeen- Just to give an example, if one compresses an English tropy between, in this case, Italianand English(suppos- text the length of the zipped file is typically of the order ingthetwotextsareeachonearchetypalrepresentations of one fourth of the length of the initial file. An English of their Language, which is not). file is encoded with 1 byte (8 bits) per character. This We should remark that the relative entropy is not a meansthatafterthecompressionthefileisencodedwith distance(metric)inthemathematicalsense: itisneither about 2 bits per character. Obviously this is not yet symmetric,nor does it satisfy the triangleinequality. As optimal. Shannon with an ingenious experiment showed we shall see below, in many applications, such as phylo- that the entropy of the English text is between 0.6 and genesis,itiscrucialtodefineatruemetricthatmeasures 1.3 bits per character [32] (for a recent study see [19]). the actual distance between sequences. Itiswellknownthatcompressionalgorithmsrepresent There exist several ways to measure the relative en- a powerful tool for the estimation of the AC or more so- tropy (see for instance [35, 36, 37]). One possibility is phisticated measures of complexity [33, 34, 35, 36, 37] of course to follow the recipe described in the previous and several applications have been drawn in several example: using the optimal coding for a given source to fields [38] from dynamical systems theory (the connec- encode the messages of another source. tions between Information Theory and Dynamical Sys- Here we follow the approach recently proposed in [3] tems theory are very strong and go back all the way whichis similartothe approachbyZivandMerhav[36]. to Kolmogorov and Sinai works [39, 40]. For a recent In particular in order to define the relative entropy be- overview see [41, 42, 43]) to linguistics (an incomplete tween two sources A and B we consider a sequence A list would include [3, 6, 7, 8, 9, 44, 45, 46, 47, 48]), ge- from the source A and a sequence B from the source B. netics (see [5, 10, 49, 50, 51, 52] and references therein) We now perform the following procedure. We create a and music classification [53, 54]. new sequence A+B by appending B after A and use the LZ77 algorithm or, as we shall see below, a modified version of it. A. Remoteness between two texts In[11]ithasbeenstudiedindetailwhathappenswhen a compression algorithm tries to optimize its features at It is interesting to recall the notion of relative entropy the interface between two different sequences A and B (or Kullback-Leibler divergence [55, 56, 57]) which is a while zipping the sequence A + B obtained by simply measureofthe statisticalremotenessbetweentwodistri- appending B after A. It has been shown in particular butionsandwhoseessencecanbeeasilygraspedwiththe the existence of a scaling function ruling the way the following example. compression algorithm learns a sequence B after having Letusconsidertwostationaryzero-memorysourcesA compresseda sequenceA. Inparticularit turns outthat and B emitting sequences of 0 and 1: A emits a 0 with it exists a crossoverlength for the sequence B, given by probability p and 1 with probability 1−p while B emits 0 with probability q and 1 with probability 1−q. As alreadydescribed,acompressionalgorithmlikeLZ77ap- L∗ ≃Lα (6) B A 5 with α = h(B) . This is the length below which the result we would obtain could be heavily affected by h(B)+d(B||A) the compression algorithm does not learn the sequence the names of the characters present in both parts. More B (measuring in this way the cross entropy between A importantly, defining the position of the cut would be and B) and above which it learns B, i.e. optimizes the completely arbitrary, and this arbitrariness would mat- compression using the specific features of B. ter a lot especially for very short sequences. We shall ThismeansthatifB isshortenough(shorterthanthe address this problem in section III. crossover length), one can measure the relative entropy by zipping the sequence A+B (using gzip or an equiv- alent sequential compression program); the measure of B. On the definition of a distance the length of B in the coding optimized for A will be ∆AB = LA+B −LA, where LX indicates the length in In this section we address the problem of defining a bits of the zipped file X. The cross entropy per charac- distance between two generic sequences A and B. A dis- ter between A and B will be estimated by tanceD isanapplicationthatmustsatisfythreerequire- ments: C(A|B)=∆ /|B|, (7) AB 1. positivity: D ≥0 (D =0 iff A=B); AB AB where |B| is the length in bits of the uncompressed file B. The relative entropy d(A||B) per character between 2. symmetry: D =D ; AB BA A and B will be estimated by 3. triangular inequality: D ≤D +D ∀C; AB AC CB d(A||B)=(∆AB −∆B′B)/|B|, (8) As it is evident the relative entropy d(A||B) does not where B′ is a secondsequence extractedfrom the source satisfy the last two properties while it is never negative. B with |B′| characters and ∆B′B/|B| = (LB+B′ − Nevertheless one can define a symmetric quantity as fol- LB)/|B| is an estimate of the entropy of the source B. lows: If, on the other hand, B is longer than the crossover lengthwemustchangeourstrategyandimplementanal- C(A|B)−C(B|B) C(B|A)−C(A|A) P =P = + AB BA gorithmwhichdoesnotziptheBpartbutsimply“reads” C(B|B) C(A|A) it with the (almost) optimal coding of part A. In this (9) case we start reading sequentially file B and search in We now have a symmetric quantity, but P does not AB the look-ahead buffer of B for the longest sub-sequence satisfies,ingeneral,thetriangularinequality. Inorderto already occurred only in the A part. This means that obtain a real mathematical distance we give a prescrip- wedo notallowforsearchingmatchesinside B itself. As tion according to which this last property is met. For in the usual LZ77, every matching found is substituted every pair A and B of sequences, the prescription writes with a pointerindicating where, inA, the matching sub- as: sequence appears and its length. This method allows us to measure (or at least to estimate) the cross-entropy if PAB >minC[PAC +PCB] then between B and A, i.e. C(A|B). P =min [P +P ]. (10) AB C AC CB Before proceeding let us briefly discuss which difficul- tiesonecouldexperimentinthepracticalimplementation By iterating this procedure until for any A,B,C P ≤ AB of the methods described in this section. First of all in P +P ,weobtainatruedistanceD . Inparticular AC CB AB practical applications the sequences to be analyzed can the distanceobtainedinthis wayissimply the minimum be very long and their direct comparison can then be over all the paths connecting A and B of the total cost problematic due to finiteness of the window over which of the path (according to P ): i.e. AB matching can be found. Moreover in some applications oneisinterestedinestimatingtheself-entropyofasource, N−1 i.e. C(A|A) in a more coherent framework. The estima- D = min min P . tionofthisquantityisnecessarytocalculatetherelative- AB {N≥2}{X1,...,XN:X1=A,XN=B} Xk=0 XkXk+1 entropy between two sources. In fact, as we shall see in (11) the next section, even though in practical applications AlsoitiseasytoseethatDAB isthemaximaldistance the simple cross-entropyis often used, there are cases in notlargerthanPA,B foranyA,B,wherewehaveconsid- which relative entropy is more suitable. The most typi- ered the partial orderingon the set of distances: P ≥P′ calcaseis whenweneedto build asymmetricaldistance if PAB ≥PA′B, for all pairs A,B. betweentwosequences. Onecouldthinktoestimateself- Obviouslythisisnotana-prioridistance. Thedistance entropycomparing,withthemodifiedLZ77,twoportions betweenAandB depends,inprinciple,onthesetoffiles of a given sequence. This method is not very reliable we are considering. since many bias could afflict the results obtained in this In all our tests with linguistic texts the triangle con- way. For example if we split a book in two parts and try dition was always satisfied without the need to have re- to measure the cross-entropy between these two parts, course to the above mentioned prescription. However 6 there are cases in other contexts, like, for instance, ge- this distance by the so-called Normalized Compression netic sequences, in which could be necessaryto force the Distance triangularizationprocedure described above. An alternative definition of distance can be givencon- C(xy)−min(C(x),C(y)) sidering NCD(x,y)= (14) max(C(x),C(y)) RAB =pPAB, (12) where C(xy) is the compressedsize of the concatenation of x and y, and C(x) and C(y) denote the compressed where the square root must be taken before forcing the size of x and y, respectively. Then this quantities are triangularization. The idea of using R is suggested AB approximatedinasuitable wayby usingrealworldcom- by the fact that as A and B are very close sources then pressors. P is of the order of the square of their “difference”. AB It is important to remark how it exists a discrepancy Let us see this in a concrete example where the distance between the definition 13 and its actual approximate between the two sources is very small. Suppose having computation 14. two sources A and B which can emit sequences of 0 and We discuss here in some details the case of the LZ77 1. Let A emit a 0 with a probability p and 1 with the compressor. Using the results presented in Sect.IIA, one complementary probability 1−p. Now let the source B obtains that, if the length of y is small enough (see ex- emita0withaprobabilityp+ǫanda1withaprobability pression 6), NCD(x,y) is actually estimating the cross- 1−(p+ǫ), where ǫ is an infinitesimal quantity. In this entropy between x and y. The cross-entropy is not a situationitcanbe easilyshownthatthe relativeentropy distance since it does notsatisfy the identity axiom,it is between A and B is proportional to ǫ2 and, of course, notsymmetricalnor itsatisfies the triangularinequality. P is then proportional to the same quantity. Taking AB In the general case of y being not small, again follow- the square root of P is then simply requiring that, if AB ing the discussion of Sect.IIA (presented in more details twosourceshaveadistributionofprobabilitythatdiffers in [11]), one can show that NCD(x,y) is given roughly for a small ǫ, their distance must be of the order of ǫ (for L large enough) by: instead of being reduced to the ǫ2 order. x It is important to recall that an earlier and rigor- ous definition of an unnormalized distance between two 1+ Lαx d(x||y), (15) generic strings of characters has been proposed in [58] Ly C(y) in terms of the KolmogorovComplexity and of the Con- where L and L are the lengths of the x and y files ditional Kolmogorov Complexity [30] (see below for the x y (with L >> Lα) and d(x||y) is the relative entropy definition). y x rate between x and y. Again this estimate does not de- A normalized version of this distance has been pro- fine a metric. Moreover, since α ≤ 1 one can see that posed in [52, 59]. In particular Li et al. define NCD(x,y) → 1, independently of the choice of x and y when L and L tends to infinity. x y max(K(x|y),K(y|x)) Thediscrepancybetweenthedefinitionofamathemat- d (x,y)= (13) K icaldistancebasedontheConditionalKolmogorovCom- max(K(x),K(y)) plexity and its actual approximate computation in [52] where the subscript K refers to its definition in terms has also been pointed out in [60]. of the Kolmogorov complexity. K(x|y) is the condi- Finallyit isimportanttonotice thatrecentlyOtuand tional Kolmogorov Complexity defined as the length of Sayood [61] have proposed an alternative definition of the shortest program to compute x if y is furnished as distance between two string of characters, which is rig- an auxiliary input to the computation, and K(x) and orous and computable. Their approach is based on the K(y) are the Kolmogorov complexities of strings x and LZ complexity [62] of a sequence S which can be defined y,respectively. Thedistanced (x,y)issymmetricaland in terms of the number of steps required by a suitable K it is shown to satisfy the identity axiom up to a preci- production process to generate S. In their very interest- siond (x,x)=O(1/K(x))andthetriangularinequality ing paper they also give a review on this and correlated K d (x,y) <= d (y,z)+d (z,y) up to an additive term problems. We do not enter here on the details and we K K K O(1/max(K(x),K(y),K(z))). refer the reader to [61]. Theproblemwiththisdistanceisthefactthatitisde- finedintermsoftheConditionalKolmogorovComplexity which is an uncomputable quantity and its computation III. DICTIONARIES AND ARTIFICIAL TEXTS is performed in an approximate way. Inparticularwhatisimportantisthatthespecificpro- Aswe haveseenLZ77substitutes sequencesofcharac- cedure (algorithm) used to approximate this quantity, ters with a pointer to their previous appearance in the which is indeed a well defined mathematical operation, text. We now need some definitions before proceeding. defines a true distance. In the specific case of the dis- We call dictionary of a sequence the whole set of sub- tance d (x,y) defined in [52] the authors approximate sequences substituted with a pointer by LZ77, and we K 7 Frequency Length Word Frequency Length Word 110 6 .⌣The⌣ 1 80 ,–⌣Such⌣a⌣funny,⌣sporty,⌣gamy, 107 7 in⌣the⌣ ⌣jesty,⌣joky,⌣hoky-poky⌣lad,⌣is 98 4 you⌣ ⌣the⌣Ocean,⌣oh!⌣Th 94 6 .⌣But⌣ 1 78 ,–⌣Such⌣a⌣funny,⌣sporty,⌣gamy, 92 9 from⌣the⌣ ⌣jesty,⌣joky,⌣hoky-poky⌣lad,⌣is 92 5 ⌣very⌣ ⌣the⌣Ocean,⌣oh!⌣ 91 4 one⌣ 1 63 ”⌣”I⌣look,⌣you⌣look,⌣he⌣looks;⌣ we look,⌣ye⌣look,⌣they look.”⌣”W TABLE I: Most frequent LZ77-words found in Moby 1 63 ”!⌣”I⌣look,⌣you⌣look,⌣he⌣looks; Dick’s text: Here we present the most represented word in ⌣we look,⌣ye⌣look,⌣they look.”⌣” the dictionary of Moby Dick. The dictionary was extracted 1 54 repeated⌣in⌣this⌣book,⌣that⌣the usinga32768 slidingwindow in LZ77. The ⌣ representsthe space character. the⌣skeleton⌣of⌣thewhale 1 46 .⌣THIS⌣TABLET⌣Is⌣erected⌣to ⌣his⌣Memory⌣BY⌣HIS⌣ 1 43 s⌣a⌣mild,⌣mild⌣wind,⌣and⌣a⌣ refer to these sequences as dictionary’s words. As it is mild⌣looking⌣sky evident from these definitions, a particular word can be present many times in the dictionary. Finally, we call TABLEII:LongestwordsinMobyDick: Herewepresent root of a dictionary the sequence it has been extracted the longest words in the dictionary of Mody Dick. Each of from. It is important to stress how this dictionary has thesewordsappearsonlyonetimeinthedictionary. Thedic- in principle nothing to do with the ordinary dictionary tionary was extracted using a 32768 sliding window in LZ77. of a given language. On the other hand there could be important similarities between the LZ77-dictionary of a writtentextandthedictionaryoftheLanguageinwhich the text is written. As an example we report in Table I upon that can onge Sirare ce more le in and for contrding and Table II the most frequent and the longest words to the nt him hat seemed ore, es; vacaknowt.” ” it seem- found by LZ77 while zipping Melville’s Moby Dick text. side delirirous from the gan . All ththe boats bedagain, Figure 2 reports an example of the frequency-length dis- brightflesh,yourselfhe blacksmith’sleg t. Mre?loft restoon tribution of the LZ77-wordsas a function of their length (for a very similar figure and similar but less complete Asitis evidentthe meaningis completelylostandthe dictionary analysis see [10]). only feature of this text is to represent in a significant Beyond their utility for zipping purposes, the dictio- statisticalwaythetypicalstructuresfoundintheoriginal nariespresentanintrinsicinterestsinceonecanconsider root text (i.e. the typical subsequences of characters). them as a source for the principal and more important Thecaseofsequencesrepresentingtextsis interesting, syntactic structures present in the sequence/text from and it is worth spending a few words about it, since a which the dictionary originates. clear definition of word already exists in every language. Astraightforwardapplicationis the possibilityto con- In this case one could also define natural artificial texts struct Artificial Texts. With this name we mean se- (NAT). A NAT is obtained by concatenating true words quences of characters build by concatenating words ran- asextractedfroma specific textwrittenin acertainlan- domly extracted from a specific dictionary. guage. Also in this case each word would be chosen ac- Eachwordhasaprobabilityofbeingextractedpropor- cording to a probability proportionalto the frequency of tionalto the number ofits occurrencesinthe dictionary. its occurrence in the text. Just for comparison with the Since typically LZ77 words already contains spaces, we previous AT we report an example of a natural artificial do notinclude further spaces separating them. It should text built using real words from the English dictionary be stressedas the structure of a dictionary is affected by taken randomly with a probability proportional to their the size of LZ77 sliding window. In our case we have frequency of occurrence in Moby Dick’s text. typicallyadoptedwindowsof32768characters,and, ina few cases, of 65536 characters. of Though sold, moody Bedford opened white last on Below we present an excerpt of 400 characters taken night;FRENCHunnecessarythecharitableutterlyformsub- from an artificialtext (AT) having Melville’s Moby Dick merged blood firm-seated barricade, and one likely keenly text as root. end, sort was the to all what ship nine astern; Mr. and Ratherbythoseofdownwarddumbminuteandareessential those boats round with at coneedallioundantic turneel- werebabythebalancingrightthereuponflagweremonths, ing he had Queequeg, man .”Tisheed the o corevolving se equatorial whale’s Greenland great spouted know Delight, were by their fAhab tcandle aed. Cthat the ive ing, head had 8 105 Cross−entropy Estimation for Original Sequences . different words 1) Text A Text B C(A|B) all words Vs 104 log-normal fit (all words) Artificial Text Comparison ds or103 1) Dictionary Extraction w of Text A LZ77 Dict A er Text B Dict B mb102 u n 2) Creation of Artificial Texts 101 Dict A ArtText A1, ArtText A2 Dict B ArtText B1, ArtText B2 100 3) Cross−entropy Estimation for Artificial Texts 1 10 100 length ArtText A1 Vs ArtText B1 C(1|1) ArtText A1 Vs ArtText B2 C(1|2) 106 ArtText A2 Vs ArtText B1 C(2|1) true sequence ArtText A2 Vs ArtText B2 C(2|2) lognormal fit randomized sequence 4)Averaging C(A|B) = < C > + σ − C ds104 or w FIG. 3: Artificial Text Comparison (ATC) method: er of ThisistheschemeoftheArtificialTextComparisonmethod. b Instead of comparing two original strings, several AT (two m nu102 infigure)arecreated startingfrom thedictionariesextracted fromtheoriginalstrings,andthecomparisonisbetweenpairs of AT. For each pair of AT coming from different roots a cross-entropy value C(i|j) is measured and thecross-entropy between the root strings is obtained as the average < C > 100 100 101 102 103 104 of all the C(i|j). This method has theadvantage of allowing length for an estimation of an error, σ, on the obtained value of the cross-entropy < C >, as the standard deviation of the FIG. 2: LZ77-Word Distribution This figure illustrates C(i|j). From the point of view of the ATC computational thedistributionoftheLZ77-wordsfoundindifferentstringsof demand,point 1) simply consists in theprocedure of zipping characters. Above: resultsforthedictionaryofMobyDickare the original files, that usually requires few seconds, points shown. In the upper curveseveral findings of the same word 2) and 4) are of course negligible, while point 3) is crucial. are considered separately; in the lower curve each different Obviously, in fact, the machine time requested for the cross- wordiscountedonlyonce. Itcanbeshownthatthepeaksare entropy estimation grows as thesquare power of the number well fittedby a log-normal distribution, while there are large of ATcreated (for fixedlength of theAT). deviations from it for large lengths. Below: words extracted from Mesorhizobium loti bacterium’s original and reshuffled DNA sequences are analyzed. The log-normal curve fits well thewholedistribution of words extractedfrom thereshuffled portional to its length. On the other hand one can con- string,butisunabletodescribethepresenceofthelongwords of the trueone. struct AT by merging dictionaries coming from different originaltexts: mergingdictionariesextractedfromdiffer- enttexts allaboutthe samesubjectorallwrittenbythe WenowdescribehowArtificialTextscanbeeffectively same author. In this way the AT would play the role of used for recognition and classification purposes. First anarchetypaltextofthatspecificsubjectorthatspecific of all AT present several positive features. They allow author [63]. to define typical words for generic sequences (not only ThepossibilitytoconstructmanydifferentATallrep- for texts). Moreover for each original text (or original resentative of the same original sequence (or of a given sequence), one can construct an ensemble of AT. This source)allowsforanalternativewaytoestimatetheself- opens the way to the possibility of performing statistical entropy of a source (and consequently the relative en- analysis by comparing the features of many AT all rep- tropy between two sources as mentioned above). The resentative of the same original root text. In this way crossentropybetweentwoATcorrespondingtothesame it is possible to overcomeall the difficulties, discussed in source will give in fact directly an estimate of the self- the previous section, related to the length of the strings entropy of the source. This is an important point since analyzed. In fact it seems very plausible that, once a in this way it is possible to estimate the relative entropy certain “reasonable” AT size has been established, any andthedistancesbetweentwotextsoftheformproposed string can be well represented by a number of AT pro- in eq. 9 in a coherent framework. Finally, as it is shown 9 in Figure 3, comparing many AT coming from the same Dutch, English, Finnish, French, German, Italian, Por- two roots (or single root), we can estimate a statistical tuguese, Spanish and Swedish. Using 10 texts for each error on the value of the cross-entropy between the two language we had a collection of 100 texts. We have ob- roots. tainedthatforanysingletextthemethodhasrecognized With the help of AT we can then build a compari- the language. This means that the text A for which the i sonscheme(ArtificialTextComparisonorATC)(seefig- cross entropy with the unknown X text was the small- ure 3) between sequences whose validity will be checked est was a text written in the same language. We found in the following sections. This scheme is very general out also that if we ranked for each X text all the texts since it can be applied to any kind of sequence indepen- A as a function of the cross entropy, all the texts writ- i dently of the coding behind it. Moreover the generality ten in the same language of the unknown text were in of the scheme comes from the fact that, by means of a the first positions. This means that the recall, defined re-definition of the concept of word, we are able to ex- in the framework of information retrieval as the ratio tractsubsequencesfromagenericsequenceusingadeter- betweenthe numberofrelevantdocumentsretrieved(in- ministic algorithm (for instance LZ77) which eliminates dependently ofthe positioninthe ranking)andthetotal every arbitrariness (at least once the algorithm for the number of existing relevant documents, is maximal, i.e. dictionary extraction has been chosen). In the following equal to one. The recognition of language works quite sectionsweshalldiscussindetailhowonecanuseATfor well for length of the X file as small as a few tens of recognition and classification purposes. characters. IV. RECOGNITION OF LINGUISTIC B. Authorship attribution FEATURES Suppose now to be interested in the automatic recog- Our first experiments are concerned with recognition nitionoftheauthorofagiventextX. Weshallconsider, of linguistic features. Here we consider those situations as before, a collection, as large as possible, of texts of in which we have a corpus of known texts and one un- several(known)authorsallwritteninthesamelanguage known text X. We are interested here in identifying the of the unknown text and we shall look for the text A i known text A closest (according to some rule) to the X forwhichthe crossentropywiththeX textisminimum. one. We then say that X, being similar to A, belongs to In orderto collect a certain statistics we have performed the same group of A. This group can, for instance, be the experiment using a corpus of 87 different texts [65] formedbyalltheworksofanauthor,andinthatcasewe of 11 Italian authors, using for each run one of the texts say that our method attributed X to that author. We in the corpus as the unknown X text. In a first step we now present results obtained in experiments of language proceeded exactly as for language recognition, using the recognitionandauthorshipattribution. After havingex- actual texts. The results, shown in Table III, feature a plained our experiments we will be able to make some rate of success of roughly 93%. This rate is the ratio morecommentsonthecriterionweadoptedtosetrecog- between the number of texts whose author has been rec- nition and/or attribution. ognized (another text of the same author was ranked as first) and the total number of texts considered. There areof coursefluctuations inthe success rate for eachau- A. Language recognition thorandthishastobeexpectedsincethewritingstyleis something difficult to grasp and define; moreover it can Supposeweareinterestedintheautomaticrecognition vary a lot in the production of a single author. of the language in which a given text X is written. This Wethenproceededanalyzingthesamecorpuswiththe case can be seen as a first benchmark for our recogni- ATC method we have discussed in the previous section. tion technique. The procedure we use considers a collec- Weextractedthedictionaryfromeachtext,andbuiltup tion (a corpus), as large as possible, of texts in different our 87 artificial texts (each one 30000 characters long). (known)languages: English,French,Italian,Tagalog.... Ineachrunofourexperimentwechoseoneartificialtext We take an X text to play the role of the unknown text to play the role of the text whose author was unknown whose language has to be recognized, and the remaining andtheother86tobeourbackground. Theresultissig- A texts of our collection to form our background. We nificant. We found that 86 times on 87 trials the author i then measure the cross entropy between our X text and was indeed recognized, i.e. the cross entropy between every A with the procedure discussed in section II. The our unknown text and at least another text of the right i text,amongtheA group,withthesmallestcrossentropy authorwasthesmallest. Thismeansthattherateofsuc- i with the X one, selects the language closest to the one cessusingartificialtextswasof98.8%. Theunrecognized of the X file, or exactly its language, if the collection of text was L’Asino by Machiavelli, which was attributed languages contains this language. In our experiment we to Dante (La Divina Commedia), and, in fact, these are have consideredin particular a corpus of texts in 10 offi- bothpoetictexts;soitdoesnotappearsostrangethink- ciallanguagesoftheEuropeanUnion(UE)[64]: Danish, ingthatL’Asinoisfoundtobeinsomewayclosertothe 10 AUTHOR Number of Successes: Successes: Successes: all the 25 cross entropies between an AT created from texts Actualtexts ATC NATC X text and an AT extracted from that Ai. In this way we obtained 86 cross entropy values, and we set author- Alighieri 5 5 5 5 ship attribution using the usual minimum-criterion. We D’Annunzio 4 4 4 4 found again that 86 texts over 87 were well attributed, Deledda 15 15 15 15 L’Asino being again mis-attributed. Fogazzaro 5 4 5 5 This result shows that ATC is a robust method since Guicciardini 6 5 6 6 it does not seem to be strongly influenced by the par- Machiavelli 12 12 11 10 ticular set of artificial texts. In particular, as we have Manzoni 4 3 4 4 discussed before, ATC allows for a quantification of the error committed on the cross entropy estimation. De- Pirandello 11 11 11 11 finedasσ thestandarddeviationestimatedforthemth Salgari 11 10 11 11 m cross-entropy,inarankinginwhichthesmallestcrossen- Svevo 5 5 5 5 tropyvalueisthefirstone,weempiricallyobservedthese Verga 9 7 9 9 relations: TOTALS 87 81 86 85 σ σ σ 1 2 3 ≃ ≃ ≃0.5% (16) TABLEIII:Author recognition: This tableillustrates the C C C 1 2 3 results for the experiments of author recognition. For each authorwereportthenumberofdifferenttextsconsideredand a measure of success for each of the three methods adopted. (C −C )≃σ ≃σ . (17) 2 1 1 2 Labeled as successes are the numbers of times another text of thesame author was ranked in the first position using the The difference C2−C1 gives an indication of the level minimum cross-entropy criterion. ofconfidenceoftheresults. Whenthisdifferenceisofthe order of the standard deviation of C and C , this is an 1 2 indication that the result for the attribution has anhigh levelofconfidence(atleastinsidethe corpusofreference CommediaratherthantoIlPrincipe. Aslightlydifferent files/texts considered). way to proceedis the following. Instead of extracting an Finally,inordertoexplorethepossibilityofusingnat- artificialtextfromeachactualtext, wemadeasinglear- uralwords,we performed experiments with naturalarti- tificial text, which we call the author archetype, for each ficialtexts. We callthis methodNaturalATC orNATC. author. To do this we simply joined all the dictionar- We built up 5 artificial texts for each actual one using ies of the author and then proceeded as before. In this italian words instead of words extracted by LZ77. Hav- case we used actual works as unknown texts and author ing these natural artificial texts we proceeded exactly as archetypes as background. We obtained that 86 out of before. We obtained that 85 over 87 texts where rec- 87unknownrealtextsmatchedtherightartificialauthor ognized. Besides L’Asino, the other mismatch was the text, the one missing being again L’Asino. Istorie Fiorentine by Machiavelli that was set closest to Inordertoinvestigatethismismatchingfurtherweex- Storie Fiorentine dal 1378 al 1509 by Guicciardini. It ploited one of the biggest advantages the ATC method seems clear that the closeness of the subjects treated in can give if compared to the real text comparison. While the two texts played a fundamental role in the attribu- inrealtextcomparisononly onetrialcanbe made,ATC tion. allowsforcreatinganensembleofdifferentartificialtexts, It is interesting trying some conjectures on why artifi- andsomorethanonetrialispossible. Inourspecificcase, cial texts made up by LZ77 extracted dictionary worked however,10ATCdifferenttrialsperformedbothwithar- better in our experiment. Probably the main reason is tificial texts and with author archetypes gave the same thatLZ77veryoftenputssomecorrelationbetweenchar- result, attributing L’Asino to Dante. This can probably acters and actual words by grouping them into a single confirm our supposition that the pattern of poetic regis- word, while clearly this correlation does not exist using ter is verystronginthis case. Tobe surethatour 98.8% naturalwords. Inatextwrittentoberead,wordsand/or rate of success was not due to a particular fortuitous ac- characters are correlated in a precise way, especially in cidentinoursetofartificialtexts,werepeatedourexper- some cases(one of the moststrict, but probably less sig- iment with a corpus formed by 5 artificial texts of each nificant,is“.” followedbyacapitalletter). Theseobser- actual text. This means that our collection was formed vationscouldmaybesuggestthatLZ77isabletocapture by 435 texts. We then proceeded in the usual way. Hav- correlations that are in some sense a signature of an au- ing our cross entropies between the 5 X (n = 1,...,5) thor,thissignaturebeingstronger(uptoacertainpoint, n artificialtextscomingfromthesamerootX,andthe re- of course) than that of the subject of a particular text. maining430ATs,wefirstjoinedallthe rankingsrelative Onthe otherhandthis abilityofkeepingmemoryofcor- totheseX . Thuswehad430×5cross-entropiesbetween relations,combinedwiththespecificityofpoeticregister, n the AT extracted by the same root X and the other AT couldalsoexplaintheapparentstrengthofpoeticpattern of our ensemble. We then averaged, for each root A , that seems to emerge from our experiments. i