Hyperbolic statistics and entropy loss in the description of gene distributions K. Lukierska-Walasek∗ Faculty of Mathematics and Natural Sciences, Cardinal Stefan Wyszyn´ski University, Dewajtis 5, 01-815 Warsaw, Poland K. Topolski† Institute of Mathematics, Wroclaw University, Pl. Grunwaldzki 2/4, 50-384 Wroclaw, Poland K. Trojanowski‡ Institute of Computer Science, Polish Academy of Science, ul.Jana Kazmierza 5,01-248 Warszawa,Poland 4 1 Zipf’slawimpliesthestatisticaldistributionsofhyperbolictype,whichcandescribetheproperties 0 of stability and entropy loss in linguistics. We present the information theory from which follows 2 that if the system is described by distributions of hyperbolic type it leads to the possibility of r entropy loss. We present the number of repetitions of genes in tne genomes for some bacteria, as a Borelia burgdorferi andEscherichia coli. Distributionsofrepetitionsofgenesingenomeappearsto M berepresented bydistributions of hyperbolictype. 8 PACSnumbers: 2 Keywords: genelength,hyperbolicdistributions,entropy ] h p 1. Introduction. In last years appeared the possibility b, c, ...., which form words as sequence of n letters. In - to provide some knowledge of the genome sequence data quantum physics the analogy to letters can be attached o in many organisms . The genome have been studied in- to pure states, and texts correspond to mixed general i b tensively by number of different methods [1]-[10]. The states. Languageshaveverydifferentalphabets: comput- s. statistical analysis of DNA is complicated because of its ers0,1 (two bits), Englishlanguage27 letters with space c complex structure; it consists of coding and non coding and DNA four nitric bases: G(guanine), A(adenine), i s regions,repetitiveelements,etc.,whichhaveaninfluence C(cytosine), T(thymine). The collection of letters can y onthelocalsequencedecomposition. Thelongrangecor- be ordered or disordered. To quantify the disorder of h relationsinsequencecompositionsofDNAismuchmore different collections of the letters we use an entropy. p [ complexthansimplepowerlaw;moreovereffectiveexpo- nents of scale regions are varying considerably between H =−Xpilog2pi, (1) 2 different species [9], [10]. In papers [5], [6] the Zipf ap- i=1 v proachtoanalyzinglinguistictextshasbeenextendedto 1 where p denotes probability of occurrence i-th letter. If i 6 statistical study of DNA base pair sequences. we take base of logarithm 2 this will lead to the entropy 5 In our paper we take into account some linguistic fea- measured in bits. When all letters have the same prob- 4 tures of genomes to study the statistics of gene lengths. ability in all states obviously the entropy has maximum . 1 We present the information theory from which follows value H . max 0 that if the system is described by special distribution A real entropy has lower value H , because in a real 4 eff ofhyperbolictype it implies apossibility ofentropyloss. 1 languages the letters have not the same probability of Distributionsofhyperbolictypedescribealsopropertyof : appearance. Redundancy R of language is defined [11] v stability and flexibility which explain that the language as follows i X considered in the paper [11] can develop without chang- H −H ing its basic structure. Similar situation can occur in a max eff r R= (2) a genomesequencedatawhichcarriesthegeneticinforma- Hmax tion. We can expect above features in gene strings as in The quantity H - H is called an information. In- max eff language strings (words, sentences) because of presence formation depend on difference between the maximum of redundancy in both cases. entropy and the actual entropy. The bigger actual en- In Sect.2 we shall present some common features of tropy means the smaller redundancy. Redundancy can linguistics and genetics. In Sect.3 we describe Code be measuredby values of the frequencies with which dif- Length Game, MaximalEntropyPrinciple and the theo- ferent letters occur in one or more texts. Redundancy remabouthyperbolictypedistributionsandentropylost. R denotes the number that if we remove the part R of Final Sect.4 contains some applications to genetics and thelettersdeterminedbyredundancy,themeaningofthe final remarks. . The distributions appear to be hyper- text will be still understood. In English some letters oc- bolic distributions distributions in sense of theorem. cur more frequently than other and similarly in DNA of 2. Some common features of linguistics and genetics. A lan- vertebratesthefrequencyofnitricbasesCandGpairsis guage is characterized by some alphabet with letters: a, usually less frequent than A and T pairs. The low value 2 of redundancy allows in easier way to fight transcription equilibrium code, P denotes probability. For example, errorsingenecode. Thepapers[5],[6]itisdemonstrated in a linguistics the listener is a minimizer, speaker is a that non coding regions of eukaryotes display a smaller maximizer. We have words with distributions p and i entropy and larger redundancy than coding regions. their codes k , i = 1,2,.... The listener chooses codes i 3. Maximal Entropy Principle and Code Length Game. In ki,the speaker chooses probability distributions pi. this sectionwe shallprovidesome mathematics fromthe We can notice that Zipf argued [14] that in the develop- informationtheorywhichwillbe helpful inthe quantita- ment of a language vocabulary balance is reached as a tive formulation of our approach, for details see [11]. result of two opposing forces: unification which tends Let A be the alphabet which is a discrete set finite or to reduce the vocabulary and corresponds to a principle countableinfinite. LetM1 and∼M1(A)arerespectively, of least effort, seen from point of view of speaker and + + the set of probability measures on A and the set of non- diversification connectedwith the listeners wishto know negative measuresP, suchthat P(A)≤1. The elements meaning of speech. in A can be thought as letters. By K(A) we denote the This principle (7) is so basic as Maximum Entropy set of mappings, compact codes, k : A → [0,∞], which Principle has a sense that search for one type of op- satisfy Kraft’s equality [13] timal strategy called as Code Length Game translates directly into a search for distributions with maximum Xexp(−ki)=1. (3) entropy. It is a given a code k ∈ ∼K(A) and dis- i∈A tribution P ∈ M1(A). Optimal strategy according + By ∼K(A) we denote the set of all mappings, general H(P) = infk∈∼K < k,P > is represented by entropy H(P), where actual strategyis representedby <k,P >. codes, k : A → [0,∞], which satisfy Kraft′sinequality Zipf’s law is an empirical observationwhich relates rank [13] and frequency of words in natural language [14]. This lawsuggestsmodelingbydistributionsofhyperbolictype Xexp(−ki)≤1. (4) [11], because no distributions over N have probabilities i∈A proportional to 1/i, due to the lack of normalization For k ∈∼ K(A) and i ∈ A, k is the code lenght, for condition. i example the length of the word. For k ∈ ∼K(A) and We consider a class of distributions P =(p1,p2,...) over P ∈M1(A) the average code length is defined as N. If p1 ≥,p2 ≥..., P is said to be hyperbolic if for any + given α > 1, p ≥ i−α for infinitely many indexes i. As i <k,P >=Xkipi. (5) an example we can choose pi ∼ i−1(log(i))−c for some constant c>2. i∈A The code lenght game for model P ∈ M1(N) with + There is bijective correspondence between pi and ki codes k : A → [0,∞] for which P exp(−ki) = 1, is i∈A ki =−lnpi and pi =exp(−ki). in equilibrium if and only if Hmax(co(P) = Hmax(P). In such a case a distribution P∗ is the H attractor max For P ∈M1(A) we also introduce the entropy such that P → P∗, if for every sequence (P ) ⊆ P + n n n>1 for which H(P ) → H (P). One expects that n max H(p)=−Xpilnpi. (6) H(P∗)= Hmax(P) but in the case with entropy loss we i∈A have H(P∗)<Hmax(P). Such possibility appears when distributions are hyperbolic. It follows from theorem in The entropycanbe representedasminimalaveragecode [11]. length, (see [11]): Theorem. Assume that P∗ ∈ M1(N) is of finite + H(P)= min <k,P >. entropy and it has ordered point probabilities. Than k∈∼K(A) necessary and sufficient condition that P∗ can occur as Let P ⊆M1(A) than Hmax- attractor in a model with entropy loss, is that P∗ + is hyperbolic. If this condition is fulfilled than for every H (P) = sup inf <k,P > h with H(P∗) < h < ∞, there exists a model P = P max h P∈Pk∈∼K(A) with P∗ as H –attractor and H (P )=h. In fact, max max h = sup H(P) P = (P| < k∗,P >≤ h) is a largest model. k∗ denotes h P⊂P the code adopted to P∗, i. e k∗ =−ln(p∗), i>1. ≤ inf sup <k,P > k∈∼K(A)P⊂P As example we can consider ”an ideal” language where = Rmin(P). (7) the frequencies of words are described by hyperbolic distribution P∗ with finite entropy. At a certain stage The formula 7 present the Nash optimal strategies. of life one is able to communicate at reasonably height R denotes minimum risk, k denotes the Nash min 3 rateaboutH(P∗)andimprovelanguagebyintroduction if we introduce distributions of hyperbolic type. In the of specialized words, which occur seldom in a language paper we follow the statistical description of the struc- as a whole. This process can be continued during the ture natural languages and we present the informatical rest of life. One is able to express complicated ideas develop language without changing a basic structure of the language. Similar situation we can expect in gene 2 ws4t.eringAssh,papwllilhciacthiuosnceartroywageenrneientifcoosr.bmtaaitniToendh.e gfreonme daGtaenewBhainchk mber of repetition1 (ftp://ftp.ncbi.nih.gov). Nu 1 0 1 2 3 4 5 6 7 8 Gene 4 n FIG.3: Graphofthenumberofrepetitionofgenesingenome o etiti3 of the Borrelia burgdorferiThe genom consists of 1239 genes p . e er of r2 b m considerationsprovidingasconclusionthestabilityprop- u N1 ertyandthepossibilityofentropyloss. Weapplysimilar reasoning to description of genoms with the genes corre- sponding to words in languages. We test three examples 1 5 10 15 20 25 30 of genomes and describe the gene statistics dealing with Gene thenumberofrepetitionsingenome. Itappearsthatthe describution of genes is reprezented by dystributions of FIG.1: Graphofthenumberofrepetitionofgenesingenome of the Escherischia coli. Genom consists of 5874 genes hyperbolic type. Oneofthe authors(K.L-W)wouldliketo thank prof. Franco Ferrari for discussions. 5 4 Number of repetition 23 [1∗†‡] WEEEllleee.cccLtttrrrioooannnniiidccc aaaKddd.dddKrrreeeasssnsss:::ekktkoo..,ltpurEooklujisaekrrnois@[email protected]@hkL.sueuwtknt.sie.w.dw1.uer7.odp,cul6..p5pl5l (1992) 1 [2] C. K.Peng et al.,Nature(London) 356, 168 (1992). [3] R.F.Voss PhysRev.Lett.68, 3805 (1992). 00 5 10 15 20 25 30 35 40 [4] C. K.Peng et al.,Phys. Rev.E 49 ,1685(1994) Gene [5] R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin,C.K.Peng,M.Simons,andH.E.Stanley,Phys. FIG.2: Graphofthenumberofrepetitionofgenesingenome Rev. Lett.73, 3169, 1994. of the Escherischia coli. Genom consists of 4892 genes [6] R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin,C. K.Peng, M.Simons, andH.E.Stanley,Phys Figures 1,2 and 3 show how often the same genes are Rev. E52, 2939, 1994. repeated in gemome.For example,Figure 2 shows that [7] H.EStanleyetal,PhysicaA(Amsterdam)273,1(1999). mostlyrepeatedgenhasnumber1anditappears5times, [8] M.D.SousaVieira, Phys.Rev.E60,5932 (1999). the second one also appears 5 times, third gene 4 times [9] D.Holste et al.,Phys.Rev.E 67,061913(2003). ect.The probability of apperence of genes present hyper- [10] P. W. Messer, P. F. Arndt and M. Lassing, Phys. Rev. Lett. 94,138103(2005). bolic type distributions [11]. [11] P. Harremoes and F. Topsoe, Entropy,3,1,2001. 5. Final remarks. Zipf’s law relates the rank k and the [12] C. S.Shannon,Bell Syst. Tech.J. 27, 379, 1948. frequency of words f in natural languages by exact hy- [13] I. M. Cover, I. A. Thomas, Information theory, Wiley, perbolic formula k×f =const. Howeverthe probability New York,1991. values 1/n for all natural numbers n do not satisfy the [14] K.G.Zipf,HumanBehaviourandthePrincipleofLeast normalizabilitycondition. This problemis not occurring Effort, Addison-Wesley,Cambridge, 1949.