Parametrized Stochastic Grammars for RNA Secondary Structure Prediction Robert S. Maier Departments of Mathematics and Physics University of Arizona 7 Tucson, AZ 85721, USA 0 Email: [email protected] 0 2 n Abstract—We propose a two-level stochastic context-free distinct HMMs for generation of symbols in {A,T,G,C}. a grammar (SCFG)architectureforparametrized stochasticmod- For ease of HMM parameter estimation, and for finding the J eling of a family of RNA sequences, including their secondary most probable parse, or path through the model (e.g., by 4 structure. A stochastic model of this type can be used for the Viterbi algorithm), the length of each CpG island and 2 maximum a posteriori estimation of the secondary structure of anynewsequenceinthefamily.TheproposedSCFGarchitecture non-CpG region should be modeled in a Markovian way, as ] modelsRNA subsequencescomprisingpaired basesas stochasti- the first hitting time in a finite-state Markov chain. That is, M cally weighted Dyck-language words, i.e., as weighted balanced- on N = {0,1,2,...}, the set of possible lengths, it should parenthesis expressions. The length of each run of unpaired have a phase-type distribution [11], [12]. There is a theorem B bases, forming a loop or a bulge, is taken to have a phase-type of the author’s on such distributions [8], which grew out of . distribution: that of the hitting time in a finite-state Markov o chain. Without loss of generality, each such Markov chain can results on positively weighted regular sequences [4], [15]. It i b be taken to have a bounded complexity. The scheme yields an saysthatwithoutlossofgenerality,thestructureoftheMarkov - overall family SCFGwith a manageable number of parameters. chaincanbegreatlyrestricted:its‘cyclicity’canberequiredto q beatmost2.ThishasimplicationsforHMMparametrization. [ The generating function G(z) of a phase-type (PH) dis- 1 I. INTRODUCTION tribution on N (which is a normalized R -weighted regular + v In biological sequence analysis, probability distributions language over a 1-letter alphabet) is a rational function of z. 6 overfinite(1-dimensional)sequencesofsymbols,representing Goingbeyondregularlanguagestothecontext-freecaseyields 3 0 nucleotides or amino acids, play a major role. They specify an algebraic generating function: one of several variables, 1 the probability of a sequence belonging to a specified family, if each type of letter in the sequence is separately kept 0 andareusuallygeneratedbyMarkovchains.Theseincludethe track of. In RNA secondary structure prediction, stochastic 7 stochastic finite-state Moore machines called hidden Markov context-free grammars (SCFGs), usually in Chomsky normal 0 models (HMMs); or infinite-state Markov chains such as form, have been used [14]. They tend to be complicated; if / o stochastic push-down automata (SPDAs). By computing the the grammar has k non-terminal symbols, then it may have i b mostprobablepaththroughtheMarkovchain,onecananswer O(k3) transition probabilities, which must be estimated from - such questions as “What hidden (e.g., phylogenetic) structure training sequences [7]. What is needed is a class of SCFGs q does a sequence have?”, and “What secondary structure will with (i) restricted internal structure, (ii) equivalent modeling : v a sequence give rise to?”. The number of Markov model power, and (iii) computationally convenient parametrization. i X parameters should ideally be kept to a minimum, to facilitate Findingsuchaclass ofmodelsisa hardproblem:evenonthe r parameter estimation and model validation. levelof1-letter-alphabet(i.e.,univariate)generatingfunctions, a The a priori modeling of an RNA sequence family is con- it involves the constructive theory of algebraic functions. sidered here.Due to Watson–Crick base pairing,a recursively In Section III, as a small step toward solving this problem, structured RNA sequence will fold, and display secondary it is pointed out that there is a class of probability distri- structure. To model stochastically both pairings and runs of butions on N with generating functions (i.e., z-transforms) unpaired bases (which form loops and bulges), results from thatarealgebraicandnon-rational,whichcanbeconveniently a subfield of formal language theory, the structure theory parametrized.Thisistheclassofalgebraichypergeometricdis- of weighted strings [6] (each string being weighted by an tributions. E.g., the N-valued random variable τ could satisfy element of a specified ‘semiring’ such as R ), are reviewed ∞ znPr(τ =n)∝ F (a,b;c;z), where F (a,b;c;·) is + Pn=0 2 1 2 1 and employed in stochastic model construction. Gauss’s (parametrized) hypergeometric function. If a,b,c are In Section II, durationmodelingis discussed: the modeling suitablychosen,n7→Pr(τ =n)will bea probabilitydensity of a probability distribution on ‘runs’, i.e., on the natural function with an algebraic z-transform. Algebraic hypergeo- numbers N. A non-RNA biological example is the modeling metric probability densities satisfy nice recurrence relations, and prediction of CpG islands in a DNA sequence. A se- and SCFG interpretations for them can be worked out. quence may flip between CpG and non-CpG states, with Amoregeneralapproachtowardsolvingtheaboveproblem, notrestrictedtothecaseofa1-letteralphabet,employsSCFGs II. DURATIONMODELING withatwo-levelstructure.Ineffect,theseareSCFGswrapped Since loops and bulges in RNA secondary structure com- around HMMs. The following is an illustration. A probabilis- prise runs of unpaired nucleotides, they can be modeled ticallyweightedDycklanguageoverthealphabet{a,b},i.e.,a without taking long-range covariation into account. The ap- distributionoverthewordsin{a,b}∗thatcomprisenesteda–b propriate stochastic model is an HMM with absorption, since pairs,isgeneratedfromasymbolS byrepeatedapplicationof the accurate modeling of run lengths is a goal. Any such theproductionruleS 7→p1·ab+p2·abS+p3·aSb+p4·aSbS. HMMwillspecifyaprobabilitydistributiononthesetoffinite Theprobabilitiespi sumto1.Ifeachofa,binturnrepresents strings Σ∗, where Σ = {A,U,G,C} is the alphabet set, and a weighted regular language over some alphabet Σ (e.g., a long words are exponentially unlikely. There should be little PH-distribution if Σ has only one letter), then the resulting changeinthenucleotidedistributionalongtypicalruns,sothe distribution over words in Σ∗ comes from a SCFG with distribution of the string length τ ∈N is what is important. the stated two-level structure. This setup is familiar from Thetimeτ toreachafinal(absorbing)stateinanirreducible (unweighted)languagetheoryappliedto compilation:thetop- discrete-timeMarkovchain on a state space Q={1,...,m}, levelstructureofaprogramisspecifiedasawordinacontext- with a transition matrix T = (T )m that is substochastic ij i,j=1 freelanguage,andislandsoflow-levelstructure(e.g.,identifier (i.e., m T ≤ 1), and an initial state vector α = (α )m Pj=1 ij i i=1 names and arithmetic literals) as words in regular languages. that is also substochastic (i.e., m α ≤ 1), is said to have Pi=1 i In Section IV, it is indicated how the idea of a SCFG a discrete PH-distribution. The substochasticity of T and α wrapped around HMMs can be applied to RNA structure expresses the absorption of probability, since they can be prediction:initially,totheparametricstochasticmodeling,ina extended to a larger state space Q˜ =Q∪{m+1}, on which given sequence family, of the recursive primary structure that they will be stochastic. The added state m+1 is absorbing. induces secondary folding. The goal is parameter estimation There is a close connection between PH-distributions and and model validation, by comparison with data on real RNA finite automata theory, in particular the theory of rational sequences.KnudsenandHein[5]andNebel[10]haveworked series over semirings [6]. If A is a semiring (a set hav- on this, using Dyck-like languages, but stochastic modeling ing binary addition and multiplication operations, ⊕ and ⊙, usingdistinctSCFGandHMMlevelsisasignificantadvance. each with an associated identity element; but not necessar- ily having a unary negation operation), then an A-rational On the level of primary RNA structure, paired nucleotides sequence, a = (a )∞ ∈ AN, is a sequence of the form will make up a subsequence of the full nucleotide sequence, n n=0 ⊕m [u ⊙(Mn) ⊙v ], where for some m > 0, M ∈ and must constitute a Dyck word, for simplicity written as i,j=0 i ij j Am×m and u,v∈Am. It is an A-weighted regular language a word over {a,b}. A distribution over the infinite family over a 1-letter alphabet. Semirings of interest here include R, of such Dyck words is determined by the above stochastic R ={x∈R|x≥0},andtheBooleansemiringB={0,1}. production rule, the parameters p ,p ,p ,p in which are + 1 2 3 4 specific to the family being modeled. The productionrule for Theorem 2.1 ([9]): Any PH-distribution on N is an R+- full sequences, including unpaired nucleotides, will have not rational sequence. Any summable R+-rational sequence, if ab, abS, aSb, aSbS on its right-hand side, but rather IaIbI, normalized to have unit sum, becomes a PH-distribution. IaIbS, IaSbI, IaSbS, where each I expands to a ‘run’ of If τ ∈ N is PH-distributed, it is useful to focus on its z- unpaired nucleotides. If the four nucleotides are treated as transform, i.e., G(z) = E[zτ] = ∞ znPr(τ = n). This Pn=0 equally likely in this context, each I will be a stochastic will be a rational function, in R+(z). If the distribution is language over a 1-letter alphabet, and the length of each run finitely supported, it will be a polynomial, in R+[z]. is reasonably modeled as having a PH-distribution. The PH Theorem 2.2 ([9]): Any PH-distribution on N can be gen- classincludesgeometricdistributions,butismoregeneral.The erated from finitely supported distributions by repeated appli- overallSCFGisobtainedbywrappingtheDyckSCFGaround cations of (i) the binary operation of mixture, i.e., G ,G 7→ 1 2 the finite-state Markov chains that yield the PH-distributions. pG +(1−p)G , where p ∈ (0,1), (ii) the binary operation 1 2 From a given family of RNA sequences, Dyck SCFG of convolution,i.e., G1,G2 7→G1G2, and (iii) the unary ‘ge- parameters can be estimated, e.g., by the standard Inside– ometric mixture’ operation, i.e., G 7→ (1−p)P∞k=0pkGk = Outside Algorithm [7]; and then HMM parameters (i.e., PH- (1−p)/(1−pG), where p∈(0,1). distribution parameters) can be estimated separately. By em- ThisisavariantoftheKleene–Schu¨tzenbergerTheoremon ploying a large enough class of PH-distributions, it should be the A-rational series associated to A-finite automata [6]. The possible to produce a better fit to data on secondary structure Boolean(A=B)caseoftheirtheoremisfamiliarfromformal thanwereobtainedfromthefew-parametermodelsofKnudsen languagetheory:itsaysthatanyregularlanguageoverafinite and Hein [5] and Nebel [10]. Once the family has been alphabet can be generated from finite languages by repeated modeled,themostlikelyparsetreeforanynewRNAsequence applications of (i) union, (ii) concatenation, and (iii) the so- in the family can be computed by maximum a posteriori calledKleenestaroperation.Justasinformallanguagetheory, estimation, using the CYK algorithm [14]. The sequence is the third operation of Theorem 2.2 can be implemented on predicted to have the secondary structure represented by that the automaton level by adding ‘loopback’, or cycle-inducing, parse tree. transitions from final state(s) back to initial state(s). Theorem 2.3 ([8]): The unary–binary computation tree III. ALGEBRAIC SEQUENCES leading to any PH-distribution on N, the leaves of which are Most work on RNA secondary structure prediction that finitelysupporteddistributions,canberequiredwithoutlossof draws on formal language theory has employed SCFGs [5], generality to have at most 2 unary ‘geometric mixture’ nodes [10], [14]. A CFG in Chomsky normal form (CNF), used for along the path extending to the root from any leaf. That is, generating strings in Σ∗ where Σ is a finite alphabet set, is a those operations do not need to be more than doubly nested. set of production rules of the form V 7→ W W or V 7→ a, 1 2 where V,W ,W are elements of a set V of ‘variables’, i.e., 1 2 This is a normalized, or ‘stochastic’, version of a result on nonterminalsymbols,anda∈Σ.Thereisadistinguishedstart the representation of R+-rational sequences [4], [15]. Results symbol S ∈ V with which the process begins. Applying the of this type are stronglysemiring-dependent.It is notdifficult production rules repeatedly yields a subset L ⊂ Σ∗, i.e., a to see that in the cases A = R and B, the analogue of language.AnSCFG assignsprobabilities(whichaddtounity) the number ‘2’ is ‘1’. (This is because, e.g., a B-rational to the productions of each V ∈ V, and yields a probability sequence is simply an sequence in BN that is eventually distribution over the strings in L⊂Σ∗, i.e., over Σ∗. periodic.)TheproofofTheorem2.3isanexplicitconstruction, The probability distribution P : Σ∗ → [0,1] ⊂ R + whichrespectspositivityconstraintsateachstage.Thatis, the producedbyanSCFGisanexampleofanR -algebraicseries. + construction solves the representation problem for univariate In general, if A is a semiring, an A-algebraic series (of CNF PH-distributions,which hasstrongconnectionsto the positive type) over an alphabet Σ is a weighting function f: Σ∗ →A realization problem in control theory [1]. obtained as one component (i.e., the component f ) of the S Whatthetheoremsays,sinceoperationsoftypes(i),(ii),(iii) formal solution of a coupled set of quadratic equations correspond to parallel composition, serial composition, and cyclic iteration of Markov chains, is that any PH-distribution fV = X cV;W1,W2fW1fW2 +XcV;aa, V ∈V, ariseswithoutlossofgeneralityfromaMarkovchaininwhich W1,W2∈V a∈Σ cycles of states are nested at most 2 deep. That is, the chain computedbyiteration[6].ThecoefficientscV;W1,W2 andcV;a may include cycles, and cycles within cycles, but not cycles are elements of A, so each fV is a sum of A-weighted within cycles within cycles. So for modeling purposes, the strings in Σ∗, or equivalently a function fV : Σ∗ → A. chain transition matrix T may be taken to have a highly It is clear that Theorem 2.1 has an analogue: any probability restricted structure. A completely connected transition graph distributiononΣ∗ producedbyaSCFGofCNFtypeissimply onastatespaceofsizemwouldhavem2 possibletransitions, a normalized R+-algebraic series. AnySCFGofCNFtypehas|V|3+|V||Σ|parameters,which and would be unnecessarily general when m is large. may be too many for practical estimation if a small sequence Unpaired nucleotide run lengths in RNA are naturally family is being modeled. To facilitate modeling, one should modeledashaving(discrete-time)PHdistributionsbecauseof useanSCFGwitharestrictedstructure,andalsoexploitresults the close connectionwith HMMs, and the consequentease of from weighted automata theory. If the nucleotide distribution parameter estimation. However, the class of PH distributions doesnotvarymuchalongtypicalsequences,thenthealphabet is so versatile that it would be useful in this context, regard- set Σ can be taken to be a 2-letter alphabet {a,b} (if one less. Discrete PH distributionsincludegeometricandnegative is modeling Watson–Crick pairing, exclusively) or even a 1- binomial distributions, and are dense (in a suitable sense) in letter alphabet (if one is modeling runs of unpaired bases). the class of distributions Pr(τ = n), n ∈ N, which have Also,onecanleveragethefactthatA-algebraicseriessubsume leading-order geometric falloff as n→∞. A-rational series, which implies (in the 1-letter case) that AnyPH distributiononN hasa z-transformG(z)=E[zτ] A-algebraic sequences, which are effectively indexed by N, that is rational in the conventional sense; equivalently, it subsume A-rational sequences. In the Boolean (A=B) case, must satisfy a finite-depth recurrence relation of the form the first statement is the familiar Chomsky hierarchy. N c Pr(τ = n + k) = 0. In fact, any probability Inthecaseofa1-letteralphabetΣ={a},anSCFGdefines Pk=0 k distribution on N with (i) a rational z-transform G(z), and a probability distribution on N ∼= {a}∗. That is, it defines an (ii) the property that the pole which G(z) necessarily has N-valuedrandomvariableτ,thelengthofthestringemittedby atz =1 is theonly poleonthe circle |z|=1,is necessarilya thestochasticpush-downautomaton(SPDA)correspondingto PH distribution [12]. This is a sort of converseof the Perron– theSCFG.TheSPDAusesV, thesetofnonterminalsymbols, Frobenius Theorem. However, there are distributions on N as its stack alphabet, and its stack is initially occupied by the whichsatisfy(i)butnot(ii),andarenotPHdistributions.They start symbol S. The stochastic production rules specify what are necessary R-rational sequences, but are not R -rational happenswhena symbolV ∈V is poppedoffthe stack:either + sequences as defined above, even though they are sequences two symbols W ,W ∈ V are pushed back, or a letter ‘a’ is 1 2 of elements of R (probabilities).In abstract-algebraicterms, emitted.Byconstruction,atleastonelettermustbeemittedby + thissituationispossiblebecausethesemiringRisnotaFatou aCNF-typeSCFGbeforeitsstackempties,soPr(τ =0)=0. extension of the semiring R [6]. The existence of patholog- The class of probability distributions on N associated to + ical examples of this type does not vitiate the usefulness of SCFGs (whether or not of CNF type), i.e., that of normalized PH distributions in run-length modeling. R -algebraic sequences, is potentially useful in parametric + stochastic modeling, but has not been widely employed. It IV. MODELING SECONDARY STRUCTURE will be denoted F here, since each distribution in it has an algebraic z-tranaslfgorm G(z) = ∞ znPr(τ = n). For A new scheme for modeling the prior distribution of sec- Pn=0 ondarystructuresinanRNAsequencefamilywillnowbepro- any SCFG, an algebraic equation satisfied by G(z) can be posed.ItwillexploittheinsightsofSectionsIIandIII,onthe computed by polynomial elimination (e.g., by computing the classofdiscretephase-typedistributionsonN(i.e.,PH ),and resultantoftheabovesystemofquadraticequations).LetPH d d thelargerclassofR -algebraicdistributionsonN (i.e.,F ). denote the class of discrete phase-type distributions. + alg If Σ = {A,U,G,C} is the alphabet set, any SCFG, or its Theorem 3.1: (i) PHd ⊂Falg. (ii) If X,Y areindependent associated SPDA, will definea probabilitydistributionon Σ∗, N-valued random variables (RVs) with distributions in PHd, the set of finite length sequences [14]. (The distribution of then conditioning on X = Y yields an RV with distribution thesequencelength,whichis a randomvariable,lies in F .) alg inFalg.(iii)If,furthermore,Z isanindependentN-valuedRV But evenif the SCFG is in Chomskynormalform (CNF), the with distribution in Falg, then conditioning on X =Z yields number of its parameters grows cubically in the number of an RV with distribution in Falg. grammar variables, as mentioned above. To facilitate estima- tion, the model should have a restricted structure. These are ‘normalized’ versions of standard facts on A- rational and A-algebraic series, in particular on their compo- The models of Knudsen and Hein [5] and Nebel [10] are sition under the Hadamard product (x ),(y ) 7→ (x y ), in representative.TheKnudsen–HeinSCFGhasvariablesetV= n n n n the special case when A = R and Σ = {a}. (See [3], [6].) {S,L,F}, and production rules + They have direct probabilistic proofs. E.g., to prove (i), one would show that starting from the distribution of τ ∈ N, the S 7→LS |L, i.e., S 7→L+ =△ L|L2 |L3 |··· , absorption time in a HMM, one can construct an SCFG that L7→s|a Fb , 1 1 yieldsthe same distributionon N. (IfPr(τ =0)>0 then the F 7→a Fb |LS, i.e., F 7→a Fb |LL+. 2 2 2 2 SCFG cannot be of CNF type.) The procedure is similar to constructing a PDA that accepts a specified regular language. Here s signifies an unpaired base and a ...b signifies two i i Much as with discrete PH distributions, it is difficult to bases that are paired in the secondary structure, so L+ pro- parametrize distributions in the class F without, rather duces runs of unpaired bases, i.e., loops (which may include alg explicitly, parametrizing the stochastic automata (SCFGs or stems)andF producesrunsofpairedbases,i.e.,stems(which SPDAs) that give rise to them; or at least their z-transforms. may include loops of length at least 2). This SCFG is not It is difficult, in general, to characterize when a probability a CNF one, but model parameters may be estimated by a distributiononNthathasanalgebraicz-transformliesinF . variant of the Inside–Outside algorithm. If one takes single alg The following example illustrates the problem. Any distri- base frequencies and pair frequencies (i.e., the probability of butionn7→Pr(τ =n)onNthathasanalgebraicz-transform ai...bi representingA–U, G–C, or even G–U) into account, necessarily satisfies a finite-depth recurrence of the form onehasonlythreeindependentparameterstobeestimated,one N C (n)Pr(τ = n+k) = 0, where the functions C , probabilityperproductionrule.KnudsenandHein (cf.Nebel) Pk=0 k k k =0,...,N,arepolynomialinn.(IfnoneoftheC depends used as their primary training set a subset of the European k on n, then the z-transform will be rational.) Consider, for database of long subunit ribosomal RNAs (LSU rRNAs) [2], example, the 2-term recurrence [16].FortheprobabilitiesofLSvs.L(fromS),theyestimated 87% vs. 13%; for s vs. a Fb (from L), 90% vs. 10%; and 1 1 (n+a)(n+b)Pr(τ =n)=(n+c)(n+1)Pr(τ =n+1), for a2Fb2 vs. LS (from F), 79% vs. 21%. Their training set actually included tRNAs as well, since they were attempting where a,b,c ∈ R are parameters, which is of this form. to model the family of folded RNA molecules as a whole. The z-transform G(z) = ∞ znPr(τ = n) of its solu- As Knudsen and Hein note, their model yields loops and Pn=0 tion is proportional, by definition, to F (a,b;c;z), which stemswithgeometricallydistributedlengths.Toimprovequan- 2 1 is Gauss’s parametrized hypergeometric function. The set of titative agreement, it would need to be made more sophisti- triples (a,b;c) ∈ R3 that yields an algebraic z-transform, cated.Itwouldalsobenefitfroma cleanerseparationbetween and hence an R-algebraic sequence n 7→ Pr(τ = n), is itstwolevels:thepaired-baseandunpaired-baselevels,i.e.,the explicitly known. It was derived in the nineteenth century by context-freeand regularlevels(in the formallanguagesense), H.A.Schwartz[13,Chap.VII].Unfortunately,itisaninfinite i.e., the SPDA and HMM levels (in the stochastic automata- discrete subset of R3, not a continuous subset. theoretic sense). The above production rules couple the two In general, the z-transform of the solution of a finite- levels together. It is not clear from Ref. [5] how well the depth recurrence of the above form will be algebraic in z model stochastically fits the length of (i) training sequences, only if the overall parameter vector of its coefficients, the (ii) the subsequences comprising paired bases, and (iii) the polynomials {C (n)}N , is confined to a submanifold of subsequences comprising unpaired bases. Separating the two k k=0 positive codimension. For distributions in F , this makes levels should facilitate the separate fitting of these quantities. alg recurrence-basedparametrizationlessusefulthanSCFG-based By definition, folded RNA secondary structure is charac- or z-transform-based parametrization. terized by a subsequence comprising paired bases, so the stochastic modeling of secondary structure in a given fam- will be the absorption time in a finite-state Markov chain, ily should initially focus on such subsequences. If pseudo- the parameters of which, i.e., transition probabilities, can be knots (a thorny problem for automata-theoreticmodeling) are estimated from empirical data. Geometric distributions, and ignored, these subsequences are effectively Dyck words, or generalizations, are appropriate. It follows from Theorem 2.3 balanced parenthesis expressions. In the absence of covari- that employing a large Markov chain with a fully connected ation, one expects to be able to generate such words over transition graph, and hence a number of parameters that {A,U,G,C} from classical Dyck words over the 2-letter grows quadratically in the number of states, would not be alphabet{a,b},consistingofopeningandclosingparentheses, appropriate. Without loss of generality, each transition graph by replacing each a–b pair independently by an A–U, G–C, canbeassumedtohaveno‘cycleswithincycleswithincycles’. orG–U pair,accordingtoobservedpairfrequencies.Knudsen In this extended (two-level) stochastic model of the sec- andHeinnotethatordermatters:inG–C/C–GpairsintRNA, ondary structure of a family of RNA sequences, the sequence the G tends to be nearer the 5′ end of the RNA than the C. length distribution still lies in F . That is because a (Dyck- alg Stilltoberesolved,ofcourse,istheselectionoftheunderlying type) SPDA wrapped around one or more HMMs is still probability distribution over Dyck words in {a,b}∗. an SPDA, with an SCFG representation. This observation is OnecouldstartwithanyCFGthatunambiguouslygenerates similar to the proof of Theorem 3.1: what has been con- theDyckwordsover{a,b},andmakeitstochasticbyweight- structed here is simply an SCFG with a special structure, ingitsproductions.ThesimplestsuchCFGisa1-variableone, not given explicitly in Chomsky normal form. The full set of S 7→ab|abS |aSb|aSbS. The corresponding SCFG is model parameters could be estimated by the Inside–Outside algorithm[7], rather than by estimating Dyck-SCFG and run- S 7→p ·ab+p ·abS+p ·aSb+p ·aSbS, 1 2 3 4 lengthparametersseparately;butthatwouldnotbesoefficient. where p = 1. This SCFG, with 3 free parameters, is The test of the proposed SCFG architecture will be its Pi i so simple that it can be studied analytically. The length of value in secondary structure prediction, since from any RNA a Dyck word is an N-valued randomvariable, the distribution sequence the most likely parse tree, and paired-base subse- of which lies in F , with a parameter-dependent, algebraic quence,canbecomputedbymaximumaposterioriestimation. alg z-transform. As was explained in Section III, it is best to REFERENCES parametrize distributions in F by the SCFGs that give rise alg [1] C.CommaultandS.Mocanu,“Phase-typedistributionsandrepresenta- to them, rather than by explicit formulas or even by the tions:Someresultsandopenproblemsforsystemtheory,”Int.J.Control, recurrence relations that they satisfy; and this is an example. vol.76,no.6,pp.566–580, 2003. This Dyck model could be made arbitrarily more versatile, [2] P.deRijk, A.Caers,Y.vandePeer, andR.deWachter, “Database on thestructureoflargeribosomalsubunitRNA,”NucleicAcidsResearch, since arbitrarily complicated CFGs that generate the Dyck vol.26,no.1,pp.183–186, 1998. language over {a,b} can readily be constructed. One could, [3] M. Fliess, “Sur divers produits de se´ries formelles,” Bull. Soc. Math. for instance, iterate S 7→ ab | abS | aSb | aSbS once, France,vol.102,pp.181–191,1974. [4] T. Katayama, M. Okamoto, and H. Enomoto, “Characterization of obtaining a production rule for S with 25 alternatives on its the structure-generating functions of regular sets and the D0L growth right-handside.Weightingthemwithprobabilitieswouldyield functions,” Inform.andControl, vol.36,no.1,pp.85–101,1978. an SCFG with 24 independent parameters, which would be [5] B. Knudsen and J. Hein, “RNA secondary structure prediction using stochasticcontext-freegrammarsandevolutionaryhistory,”Bioinformat- capableof muchmore accuratefitting of data on an empirical ics,vol.15,no.6,pp.446–454,1999. family of RNA sequences. In general, one could choose [6] W. Kuich and A. Salomaa, Semirings, Automata, Languages. New model parameters to fit not only the observed distribution of York/Berlin: Springer-Verlag, 1986. [7] K. Lari and S. J. Young, “The estimation of stochastic context-free Dyck word lengths (i.e., per-family paired-base subsequence grammars using the Inside–Outside algorithm,” Computer Speech and lengths),butalsothedistributionoflengthsofstems,i.e.,runs Language, vol.4,no.1,pp.35–56,1990. ofcontiguouspairedbases,whichmaybefarfromgeometric. [8] R.S.Maier,“Phase-typedistributionsandthestructureoffiniteMarkov chains,”J.Comp.Appl.Math.,vol.46,no.3,pp.449–453,1993. The preceding discussion of Dyck words formed from [9] R.S.MaierandC.A.O’Cinneide,“Aclosurecharacterizationofphase- paired bases ignored loops, i.e., runs of unpaired bases. They typedistributions,” J.Appl.Probab.,vol.29,no.1,pp.92–103,1992. are besthandledona second levelof the SCFG. Thesimplest [10] M.E.Nebel,“Investigation oftheBernoullimodelforRNAsecondary structure prediction,” Bull.Math.Biol.,vol.66,pp.925–964, 2004. production rule for full sequences would have not S, abS, [11] M.F.Neuts,Matrix-Geometric Solutions inStochastic Models. Balti- aSb, aSbS on its right-hand side, but rather IaIbI, IaIbS, more,Maryland: JohnsHopkinsUniversity Press,1981. IaSbI,IaSbS,whereeachoftheeightIsexpandstoarunof [12] C.A.O’Cinneide,“Characterizationofphase-typedistributions,”Comm. Statist. Stochastic Models, vol.6,no.1,pp.1–57,1990. unpaired bases. In the absence of covariation, modeling each [13] E. G. C. Poole, Linear Differential Equations. Oxford: Oxford run is a matter of duration modeling. Starting from an N- University Press,1936. valued random run length, or equivalently a distribution over [14] Y. Sakakibara, M. Brown, R. Hughey, I. S. Mian, K. Sjo¨lander, R. C. Underwood, and D. Haussler, “Stochastic context-free grammars for finite 1-letter words, one would generate a run of unpaired tRNA modeling,” Nucleic Acids Research, vol. 22, no. 23, pp. 5112– bases by replacing each letter independently by A,U,G,C, 5120,1994. according to family-specific single base frequencies. [15] M.Soittola,“Positiverationalsequences,”Theoret.Comput.Sci.,vol.2, no.3,pp.317–322,1976. Each run length is naturally taken to have a distribution [16] J.Wuyts,P.deRijk,Y.vandePeer,T.Winkelmans,andR.deWachter, in PHd, since that will allow the resulting run of bases to “The European large ribosomal subunit RNA database,” Nucleic Acids be generated by an HMM (with absorption). Each run length Research,vol.29,no.1,pp.175–177,2001.