ebook img

Hyperbolic statistics and entropy loss in the description of gene distributions PDF

0.1 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Hyperbolic statistics and entropy loss in the description of gene distributions

Hyperbolic statistics and entropy loss in the description of gene distributions K. Lukierska-Walasek∗ Faculty of Mathematics and Natural Sciences, Cardinal Stefan Wyszyn´ski University, Dewajtis 5, 01-815 Warsaw, Poland K. Topolski† Institute of Mathematics, Wroclaw University, Pl. Grunwaldzki 2/4, 50-384 Wroclaw, Poland K. Trojanowski‡ Institute of Computer Science, Polish Academy of Science, ul.Jana Kazmierza 5,01-248 Warszawa,Poland 4 1 Zipf’slawimpliesthestatisticaldistributionsofhyperbolictype,whichcandescribetheproperties 0 of stability and entropy loss in linguistics. We present the information theory from which follows 2 that if the system is described by distributions of hyperbolic type it leads to the possibility of r entropy loss. We present the number of repetitions of genes in tne genomes for some bacteria, as a Borelia burgdorferi andEscherichia coli. Distributionsofrepetitionsofgenesingenomeappearsto M berepresented bydistributions of hyperbolictype. 8 PACSnumbers: 2 Keywords: genelength,hyperbolicdistributions,entropy ] h p 1. Introduction. In last years appeared the possibility b, c, ...., which form words as sequence of n letters. In - to provide some knowledge of the genome sequence data quantum physics the analogy to letters can be attached o in many organisms . The genome have been studied in- to pure states, and texts correspond to mixed general i b tensively by number of different methods [1]-[10]. The states. Languageshaveverydifferentalphabets: comput- s. statistical analysis of DNA is complicated because of its ers0,1 (two bits), Englishlanguage27 letters with space c complex structure; it consists of coding and non coding and DNA four nitric bases: G(guanine), A(adenine), i s regions,repetitiveelements,etc.,whichhaveaninfluence C(cytosine), T(thymine). The collection of letters can y onthelocalsequencedecomposition. Thelongrangecor- be ordered or disordered. To quantify the disorder of h relationsinsequencecompositionsofDNAismuchmore different collections of the letters we use an entropy. p [ complexthansimplepowerlaw;moreovereffectiveexpo- nents of scale regions are varying considerably between H =−Xpilog2pi, (1) 2 different species [9], [10]. In papers [5], [6] the Zipf ap- i=1 v proachtoanalyzinglinguistictextshasbeenextendedto 1 where p denotes probability of occurrence i-th letter. If i 6 statistical study of DNA base pair sequences. we take base of logarithm 2 this will lead to the entropy 5 In our paper we take into account some linguistic fea- measured in bits. When all letters have the same prob- 4 tures of genomes to study the statistics of gene lengths. ability in all states obviously the entropy has maximum . 1 We present the information theory from which follows value H . max 0 that if the system is described by special distribution A real entropy has lower value H , because in a real 4 eff ofhyperbolictype it implies apossibility ofentropyloss. 1 languages the letters have not the same probability of Distributionsofhyperbolictypedescribealsopropertyof : appearance. Redundancy R of language is defined [11] v stability and flexibility which explain that the language as follows i X considered in the paper [11] can develop without chang- H −H ing its basic structure. Similar situation can occur in a max eff r R= (2) a genomesequencedatawhichcarriesthegeneticinforma- Hmax tion. We can expect above features in gene strings as in The quantity H - H is called an information. In- max eff language strings (words, sentences) because of presence formation depend on difference between the maximum of redundancy in both cases. entropy and the actual entropy. The bigger actual en- In Sect.2 we shall present some common features of tropy means the smaller redundancy. Redundancy can linguistics and genetics. In Sect.3 we describe Code be measuredby values of the frequencies with which dif- Length Game, MaximalEntropyPrinciple and the theo- ferent letters occur in one or more texts. Redundancy remabouthyperbolictypedistributionsandentropylost. R denotes the number that if we remove the part R of Final Sect.4 contains some applications to genetics and thelettersdeterminedbyredundancy,themeaningofthe final remarks. . The distributions appear to be hyper- text will be still understood. In English some letters oc- bolic distributions distributions in sense of theorem. cur more frequently than other and similarly in DNA of 2. Some common features of linguistics and genetics. A lan- vertebratesthefrequencyofnitricbasesCandGpairsis guage is characterized by some alphabet with letters: a, usually less frequent than A and T pairs. The low value 2 of redundancy allows in easier way to fight transcription equilibrium code, P denotes probability. For example, errorsingenecode. Thepapers[5],[6]itisdemonstrated in a linguistics the listener is a minimizer, speaker is a that non coding regions of eukaryotes display a smaller maximizer. We have words with distributions p and i entropy and larger redundancy than coding regions. their codes k , i = 1,2,.... The listener chooses codes i 3. Maximal Entropy Principle and Code Length Game. In ki,the speaker chooses probability distributions pi. this sectionwe shallprovidesome mathematics fromthe We can notice that Zipf argued [14] that in the develop- informationtheorywhichwillbe helpful inthe quantita- ment of a language vocabulary balance is reached as a tive formulation of our approach, for details see [11]. result of two opposing forces: unification which tends Let A be the alphabet which is a discrete set finite or to reduce the vocabulary and corresponds to a principle countableinfinite. LetM1 and∼M1(A)arerespectively, of least effort, seen from point of view of speaker and + + the set of probability measures on A and the set of non- diversification connectedwith the listeners wishto know negative measuresP, suchthat P(A)≤1. The elements meaning of speech. in A can be thought as letters. By K(A) we denote the This principle (7) is so basic as Maximum Entropy set of mappings, compact codes, k : A → [0,∞], which Principle has a sense that search for one type of op- satisfy Kraft’s equality [13] timal strategy called as Code Length Game translates directly into a search for distributions with maximum Xexp(−ki)=1. (3) entropy. It is a given a code k ∈ ∼K(A) and dis- i∈A tribution P ∈ M1(A). Optimal strategy according + By ∼K(A) we denote the set of all mappings, general H(P) = infk∈∼K < k,P > is represented by entropy H(P), where actual strategyis representedby <k,P >. codes, k : A → [0,∞], which satisfy Kraft′sinequality Zipf’s law is an empirical observationwhich relates rank [13] and frequency of words in natural language [14]. This lawsuggestsmodelingbydistributionsofhyperbolictype Xexp(−ki)≤1. (4) [11], because no distributions over N have probabilities i∈A proportional to 1/i, due to the lack of normalization For k ∈∼ K(A) and i ∈ A, k is the code lenght, for condition. i example the length of the word. For k ∈ ∼K(A) and We consider a class of distributions P =(p1,p2,...) over P ∈M1(A) the average code length is defined as N. If p1 ≥,p2 ≥..., P is said to be hyperbolic if for any + given α > 1, p ≥ i−α for infinitely many indexes i. As i <k,P >=Xkipi. (5) an example we can choose pi ∼ i−1(log(i))−c for some constant c>2. i∈A The code lenght game for model P ∈ M1(N) with + There is bijective correspondence between pi and ki codes k : A → [0,∞] for which P exp(−ki) = 1, is i∈A ki =−lnpi and pi =exp(−ki). in equilibrium if and only if Hmax(co(P) = Hmax(P). In such a case a distribution P∗ is the H attractor max For P ∈M1(A) we also introduce the entropy such that P → P∗, if for every sequence (P ) ⊆ P + n n n>1 for which H(P ) → H (P). One expects that n max H(p)=−Xpilnpi. (6) H(P∗)= Hmax(P) but in the case with entropy loss we i∈A have H(P∗)<Hmax(P). Such possibility appears when distributions are hyperbolic. It follows from theorem in The entropycanbe representedasminimalaveragecode [11]. length, (see [11]): Theorem. Assume that P∗ ∈ M1(N) is of finite + H(P)= min <k,P >. entropy and it has ordered point probabilities. Than k∈∼K(A) necessary and sufficient condition that P∗ can occur as Let P ⊆M1(A) than Hmax- attractor in a model with entropy loss, is that P∗ + is hyperbolic. If this condition is fulfilled than for every H (P) = sup inf <k,P > h with H(P∗) < h < ∞, there exists a model P = P max h P∈Pk∈∼K(A) with P∗ as H –attractor and H (P )=h. In fact, max max h = sup H(P) P = (P| < k∗,P >≤ h) is a largest model. k∗ denotes h P⊂P the code adopted to P∗, i. e k∗ =−ln(p∗), i>1. ≤ inf sup <k,P > k∈∼K(A)P⊂P As example we can consider ”an ideal” language where = Rmin(P). (7) the frequencies of words are described by hyperbolic distribution P∗ with finite entropy. At a certain stage The formula 7 present the Nash optimal strategies. of life one is able to communicate at reasonably height R denotes minimum risk, k denotes the Nash min 3 rateaboutH(P∗)andimprovelanguagebyintroduction if we introduce distributions of hyperbolic type. In the of specialized words, which occur seldom in a language paper we follow the statistical description of the struc- as a whole. This process can be continued during the ture natural languages and we present the informatical rest of life. One is able to express complicated ideas develop language without changing a basic structure of the language. Similar situation we can expect in gene 2 ws4t.eringAssh,papwllilhciacthiuosnceartroywageenrneientifcoosr.bmtaaitniToendh.e gfreonme daGtaenewBhainchk mber of repetition1 (ftp://ftp.ncbi.nih.gov). Nu 1 0 1 2 3 4 5 6 7 8 Gene 4 n FIG.3: Graphofthenumberofrepetitionofgenesingenome o etiti3 of the Borrelia burgdorferiThe genom consists of 1239 genes p . e er of r2 b m considerationsprovidingasconclusionthestabilityprop- u N1 ertyandthepossibilityofentropyloss. Weapplysimilar reasoning to description of genoms with the genes corre- sponding to words in languages. We test three examples 1 5 10 15 20 25 30 of genomes and describe the gene statistics dealing with Gene thenumberofrepetitionsingenome. Itappearsthatthe describution of genes is reprezented by dystributions of FIG.1: Graphofthenumberofrepetitionofgenesingenome of the Escherischia coli. Genom consists of 5874 genes hyperbolic type. Oneofthe authors(K.L-W)wouldliketo thank prof. Franco Ferrari for discussions. 5 4 Number of repetition 23 [1∗†‡] WEEEllleee.cccLtttrrrioooannnniiidccc aaaKddd.dddKrrreeeasssnsss:::ekktkoo..,ltpurEooklujisaekrrnois@[email protected]@hkL.sueuwtknt.sie.w.dw1.uer7.odp,cul6..p5pl5l (1992) 1 [2] C. K.Peng et al.,Nature(London) 356, 168 (1992). [3] R.F.Voss PhysRev.Lett.68, 3805 (1992). 00 5 10 15 20 25 30 35 40 [4] C. K.Peng et al.,Phys. Rev.E 49 ,1685(1994) Gene [5] R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin,C.K.Peng,M.Simons,andH.E.Stanley,Phys. FIG.2: Graphofthenumberofrepetitionofgenesingenome Rev. Lett.73, 3169, 1994. of the Escherischia coli. Genom consists of 4892 genes [6] R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin,C. K.Peng, M.Simons, andH.E.Stanley,Phys Figures 1,2 and 3 show how often the same genes are Rev. E52, 2939, 1994. repeated in gemome.For example,Figure 2 shows that [7] H.EStanleyetal,PhysicaA(Amsterdam)273,1(1999). mostlyrepeatedgenhasnumber1anditappears5times, [8] M.D.SousaVieira, Phys.Rev.E60,5932 (1999). the second one also appears 5 times, third gene 4 times [9] D.Holste et al.,Phys.Rev.E 67,061913(2003). ect.The probability of apperence of genes present hyper- [10] P. W. Messer, P. F. Arndt and M. Lassing, Phys. Rev. Lett. 94,138103(2005). bolic type distributions [11]. [11] P. Harremoes and F. Topsoe, Entropy,3,1,2001. 5. Final remarks. Zipf’s law relates the rank k and the [12] C. S.Shannon,Bell Syst. Tech.J. 27, 379, 1948. frequency of words f in natural languages by exact hy- [13] I. M. Cover, I. A. Thomas, Information theory, Wiley, perbolic formula k×f =const. Howeverthe probability New York,1991. values 1/n for all natural numbers n do not satisfy the [14] K.G.Zipf,HumanBehaviourandthePrincipleofLeast normalizabilitycondition. This problemis not occurring Effort, Addison-Wesley,Cambridge, 1949.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.