ebook img

RNA Structure Prediction PDF

303 Pages·2023·17.314 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview RNA Structure Prediction

Methods in Molecular Biology 2586 Risa Karakida Kawaguchi Junichi Iwakiri Editors RNA Structure Prediction M M B ETHODS IN OLECULAR IO LO GY SeriesEditor JohnM.Walker School of Lifeand MedicalSciences University of Hertfordshire,Hatfield Hertfordshire, UK Forfurther volumes: http://www.springer.com/series/7651 For over 35 years, biological scientists have come to rely on the research protocols and methodologiesinthecriticallyacclaimedMethodsinMolecularBiologyseries.Theserieswas thefirsttointroducethestep-by-stepprotocolsapproachthathasbecomethestandardinall biomedical protocol publishing. Each protocol is provided in readily-reproducible step-by step fashion, opening with an introductory overview, a list of the materials and reagents neededtocompletetheexperiment,andfollowedbyadetailedprocedurethatissupported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitutethekeyingredientineachandeveryvolumeoftheMethodsinMolecularBiology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexedinPubMed. RNA Structure Prediction Edited by Risa Karakida Kawaguchi Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA Junichi Iwakiri Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan Editors RisaKarakidaKawaguchi JunichiIwakiri ColdSpringHarborLaboratory DepartmentofComputationalBiology ColdSpringHarbor,NY,USA andMedicalSciences GraduateSchoolofFrontierSciences TheUniversityofTokyo Chiba,Japan ISSN1064-3745 ISSN1940-6029 (electronic) MethodsinMolecularBiology ISBN978-1-0716-2767-9 ISBN978-1-0716-2768-6 (eBook) https://doi.org/10.1007/978-1-0716-2768-6 ©SpringerScience+BusinessMedia,LLC,partofSpringerNature2023 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartofthematerialis concerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,broadcasting,reproduction onmicrofilmsorinanyotherphysicalway,andtransmissionorinformationstorageandretrieval,electronicadaptation, computersoftware,orbysimilarordissimilarmethodologynowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublicationdoesnotimply, evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevantprotectivelawsandregulations andthereforefreeforgeneraluse. Thepublisher,theauthors,andtheeditorsaresafetoassumethattheadviceandinformationinthisbookarebelievedto betrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsortheeditorsgiveawarranty, expressedorimplied,withrespecttothematerialcontainedhereinorforanyerrorsoromissionsthatmayhavebeen made.Thepublisherremainsneutralwithregardtojurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. ThisHumanaimprintispublishedbytheregisteredcompanySpringerScience+BusinessMedia,LLC,partofSpringer Nature. Theregisteredcompanyaddressis:1NewYorkPlaza,NewYork,NY10004,U.S.A. Preface Sincethe1960s,RNAsecondarystructureanalysishasbeenperformedforawidevarietyof groupsofRNAs,includingrRNAsandRNAvirusgenomes[1],andthishasbeenfollowed byRNA3Dstructureanalysis[2].OwingtotheinstabilityofRNAstructures,thenumber of the detected structures is much smaller for RNAs than for proteins, specifically 1523 RNA-only structures compared with 151,577 protein-only structures in the Protein Data Bank(PDB)[3]asof2020.Insilicostructurepredictionis,therefore,apowerfulapproach toovercomingtheselimitationsandrevealingacomprehensiveviewofRNAstructures.The mostprevalentandhigh-performancestructurepredictionmethodisbasedonathermody- namic model that takes the primary sequence as an input and predicts a representative structureaccordingtoitsstability.Freeenergychangesareestimatedasametricofstructure stabilitybyapproximationofthoseforeachsub-structureusingthemeltingtemperaturefor shortoligosorbasedonstructuresindatabases[4–6].TocomputetheprobabilityofRNA secondarystructuresbasedonthermodynamicmodels,itisnecessarytocomputethesumof P the exponential of the free energy changes. This is formulated as edGðζ,xÞ=RT, where ζ∈ΩðxÞ dG(ζ,x)isthefreeenergychangeofsequencexandstructureζ,Risthegasconstant,Tis theabsolutetemperature,andΩ(x)representsthesetofallpossiblestructuresofsequencex. dG(ζ,x) for every possible subsequence x can be efficiently computed by a dynamic pro- graming (DP) technique using the subproblems dG(ζ,x0), where x0 is a subsequence of x. AftercomputingtheDPvariablesobtainedforthefullsequence,thepartitionfunctionand probability of the structure can be computed following a canonical distribution. Previous benchmark studies have carried out the comprehensive assessment of RNA secondary and 3Dpredictiontoolswithmultiplestructuredatabases[7,8],indicatingnotonlysufficiently high prediction accuracy but also computational limitation in robustness and performance stabilityfor manypredictionmethods. In terms of accuracy and computational time, one of the main limitations of in silico structureanalysisisthepredictionoflongRNAs.Forexample,theestimationoffreeenergy changesisassumedtoshowlowaccuracyfordistantbasepairsbecausemodelsarefittedto results obtained from short RNAs. The rapid exponential increase in possible structures relativetosequencelengthisalsoacriticalproblemforfeasibility.Whiletheimprovementof applicability and performance has been actively studied for in silico structure analysis [9], experimentaltechniqueshavealsobeenappliedtosolvethoseproblemsatthesametime,for example, high-throughput structure probing methods including SHAPE-seq [10] and DMS-seq [11].These methods utilize high-throughput sequencing to capture cleavage or modification events introduced by chemical reagents or enzymes as a signature of reactive sites regardless of the length or condition of RNA samples. This is based on the principle that the probability of these events is correlated with the probability of being unpaired in RNAsecondarystructures.Bycomparingthedeterminedandpredictedaccessibilities,both scores have been found to be consistent at a transcriptome-wide level [9]. Owing to the varioussourcesofbiologicalandtechnicalnoise,thereisachallengeinextractingreproduc- ible results from high-throughput data including such noise [12, 13]. However, the inte- gration of multiple datasets relying on different techniques may uncover the true transcriptome-wideRNAstructurelandscapebyovercomingdata-specificbiases[14]. v vi Preface Moreover, high-throughput structure probing methods can be applied to a variety of conformational analyses. A careful selection of reagents and experimental settings can further enhance the comprehensiveness of detectability of base reactivity within a specific time range [15]. For the detection of the co-accessibility of multiple bases, a structure probingmethodvariantcalledmutationalprofilinghasbeendeveloped(e.g.,SHAPE-MaP [16]andDMS-MaP[17]).Insuchmethods,themutationsonthesamereadareusedasa cluetodeterminewhichstructureorcombinationofstructuresisobservedinvivo[18,19]. The secondarystructure isknownto spontaneouslyfold in amannerdeeply dependenton the primary sequence but to be perturbed by triggers that can include other molecules’ binding, modification, and different cell states (e.g., temperature, pH, or metal ion; [11, 19,20]).Analyzingtheinconsistencybetweenthepredictedstructuresand experimentally determined reactivities is, therefore, an efficient approach to discover the existence of in vivo-specific regulation that causes structure disruption. Recently, a probing method thatcapturesthefoldingofgrowingnascentRNAduringtranscriptionhasbeendeveloped [21, 22]. Because RNA secondary structure is formed simultaneously during transcription [23],thestructurecomparisonbetweennascentandmaturedRNAspotentiallyrevealsthe formation of a structure differing from the initial or stable form, suggesting that the influenceofthermodynamicfluctuationsmayaffectRNAfoldingkinetics. Compared with RNA secondary structure prediction, RNA 3D structure prediction is recognized as a more challenging problem due to the more numerous degrees of freedom for3Dstructures,suchasdistancesandanglesofeachbasepair,resultinginhighcomputa- tional demand. Despite such difficulties, improvements in RNA structure prediction have been attempted for 3D structures using information from experimentally validated struc- turesviaX-crystallographyandcryo-EM.TheaccumulationofRNA3Dstructureinforma- tion is also expected to help clarify the kinetic mechanisms of RNA and protein binding[24]. For both RNA secondary structure and 3D structure analyses, machine learning has focused on combining classical structure prediction methods with the vast amounts of experimental structure data recently obtained. One successful example has utilized deter- minedRNAstructuresinorder toimprovepredictionaccuracyofexistingRNAsecondary structurepredictionmethods.Whilesomestudieshaveappliedsuchdatatoimprovingthe parameters in thermodynamic models [25–27], another strategy cooperatively improves prediction models by aligning the predicted base pairs and detected accessibilities to be consistent[28].Theapplicationofdeeplearningmodelshasenabledtheimplementationof a new strategy to tackle complex problems, such as high-resource demands or high- dimensionality input features. For example, a deep neural network model achieved a dramatic advancement of binding motif identification required for RNA interaction deter- mination. By considering the higher-order influences of base combinations for RNA and protein binding, this model enables accurate binding site prediction as well as a deeper elucidationofRNAbindingregulationmechanisms[29].Assuch,machinelearningmeth- ods, accumulated data, and their resulting accurate structure predictions have a strong synergistic effect in moving the field of RNA biology forward by expanding the targets of RNAstructureanalysis,fromshortnon-codingRNAs(ncRNAs)tolongorcodingRNAs, withavarietyofgoals. Preface vii Organization of This Book Inthisseries,weintroducerecentprogressinRNAstructurepredictionanditsapplication fromabroadviewpoint,particularlyoverthelastseveralyears.Here,weintroducethetopics covered by 16 chapters and discuss some additional topics that are closely related to RNA structureprediction,suchasRNAinversefolding.Becauseofadvancementsinexperimental protocolsanddevices(e.g.,nanoporesequencing[30]),theintegrationofnewtypesofdata aswellasnewanalysistechniquesisnecessary.Hence,thevarietyoftopicscontainedinthis series is hoped to serve as a simple guide for both experimental and computational RNA researchers. RNASecondaryStructurePrediction RNAsecondarystructurepredictionisaproblemthatinvolvespredictingthecombinations ofbasepairingbetweencomplementarybases,namelyAwithUandGwithC,aswellasfor G–Uwobblepairs.Thegold-standardmethodforRNAsecondarystructurepredictionisa thermodynamicmodelinwhichthefreeenergychangeofthestructureisapproximatedby the experimentally obtained parameters for each partial structure. The highly stable struc- tureisthenselectedasthebestpredictionresultthattakesintoaccountthelandscapeofall possiblestructuresthroughuseofadifferentmetric,suchasminimumfreeenergy(MFE), maximum expected accuracy, and centroid structure computation [31]. Because of the enormousnumberofpossiblebasepairs,reliablesecondarystructurepredictionisobtained inexchangefortheconsiderablecomputationtimerequiredtofindstablestructureswhose freeenergychangesaresubstantiallysmall.Forexample,onecanassumethateachbaseofa sequenceof100nucleotidesisrandomlysampledfromfourtypesofbases.Ifabasecanbind toitscomplementarybaseatanyposition,eachpaircanformabasepairwithaprobabilityof 6/16¼ 0.375.Giventhis probability, theorderofall possiblestructuresfora sequenceof 100basescanreachroughly1055.Inathermodynamicmodel,however,theprobabilityof the unstable structures or base pairs is low enough that they rarely affect the results of the structureprediction.Onestrategythatignoresthoselow-probabilitystructurescanreduce thecomputationareatobeexploredwhilemaintainingtheprecisionofstructureprediction. An efficient enumeration of all possible structures can be conducted by DP. The computational time of a DP algorithm depends on the sequence length but can be varied depending on the structure types to be included. For example, pseudoknot structures consistofbasepairscrossingovereachother.ThecomputationaltimeofDP-basedstructure predictionwithpseudoknotsisO(N6)whilethecomputationaltimeforstructureswithout pseudoknots is O(N3), where N is the sequence length [32]. This makes it difficult to analyzeRNAslongerthanseveraltensofnucleotides,althoughpseudoknotscanfunctionas a key motif for biological regulation [33]. Kimchi et al. (2019) tackled this problem by developingLandscapeFold,whichenumeratesstructuresincludingpseudoknotsbasedona polymerphysicsmodel[34].ItcanpredicttheMFEstructureaswellasthedistributionof freeenergychangesforallpossiblestructuresratherthanutilizingaDPalgorithmbasedon thermodynamic models. The structure analysis of LandscapeFold can be a powerful approach to analyze the structures of functional short RNAs, such as ligands or aptamers. In addition to theprediction ofstructuresincluding pseudoknots,another problem isthat an appropriate metric is required to compare multiple RNA structures. The novel tool planeGraph2tree clusters RNA structures with pseudoknots based on the PEELING viii Preface algorithm [35]. Its input is a plane graph obtained from each RNA secondary structure. planeGraph2treeidentifiesatopologicalcentroidofthegraphandconstructsatopological centroidtree.Thedistancebetweentwostructuresisthenobtainedbytheminimumcostof editingoperations. For the structure analysis of long RNAs, even the prediction of pseudoknot-free structuresisinfeasiblebecauseofthecomputationaltimeandprecisionproblem.Computa- tion of local structuresinstead ofglobal structures isa solution to accelerate DP computa- tion.Rfoldisamodelinwhichamaximumconstraintissetforthebasepairdistanceinorder to analyze the local structure stability of RNA [36]. The model can be applied for the computationofavarietyofstructuremetrics:stemprobability,accessibility,andeachloop- type probability. However, the problems of over- and underflow as well as computational timeexistforlongRNAsincludingmRNAsandpre-mRNAs.ParasoRisanalgorithmbased ontheRfoldmodel,anditsDPalgorithmismodifiedanddistributedtomultiplecomputa- tional nodes for a local structure analysis [9]. Similar to global structure prediction, it can compute a variety of structure scores for all possible structures under the constraint of a maximumspanforbasepairs.ThisplatformisalsoappliedtoanefficientsimulationofRNA secondary structures, for example, a dynamic conformational change caused by a single point mutation. Radiam has been developed to detect mutations that can disrupt a large partofasecondarystructure,calledriboSNitches[37],evenwithinlongRNAs.Tofurther improvethecomputationofglobalstructurepredictionbasedonaDPalgorithm,Linear- Fold accelerates MFE structure prediction using beam search, which ignores only the low-probability results to avoid a substantial decrease in prediction accuracy [38]. The acceleration depends on the beam size that defines the stack size of the partial structures to be considered for the next DP computation. The computational complexity of Linear- FoldMFEstructurepredictionisO(Nblogb),whereNisthesequencelengthandbisthe user-definedbeamsize. Thisstrategycan alsobeappliedto RNA–RNAinteractionpredic- tionasimplementedinLinearCofold,whichisintroducedinthisseries. As such, a wide variety of RNA secondary structure prediction tools have been devel- opedintermsoftheirpossiblestructures,metricstoevaluatethestructures,orapproaches to extract representative structures. Several platforms have been developed for structure analysis using multiple tools simultaneously, including ViennaRNA [39], Freiburg RNA Tools [40], and Rtools [5]. Rtools is a web server that can analyze a query sequence with eight different applications for RNA secondary structure analysis. CentroidFold and Cen- troidHomfoldpredicttherepresentativestructureforthequeryaccordingtotheγ-centroid estimators. RintD and RintW are tools to visualize the distribution of the secondary structuresovertheHammingdistancefromthereferencestructures,indicatingthestability ofthereferencestructuresdespitethermodynamicfluctuation.Someotherworkbenchesfor RNA-seqanalysisalsoprovidetoolsforRNAsecondarystructureanalysis,andthesecanbe usedfor thequickanalysisoftargettranscriptsaswellasncRNAs[41]. While existing structure prediction models can achieve accurate prediction in general, theaccuracyofstructurepredictionmayoccasionallybedecreasedforcertainRNAs,suchas long RNAs or RNAs in vivo. Previous studies have attempted to improve models and parameters through machine-learning approaches. The classical thermodynamic models contain several thousands of parameters, for example, with 7850 parameters used in the fullTurnermodel[42].Optimizingthosemodelswithalargenumberofparametersposesa riskofoverfitting.MXfoldisamethodtotrainmodelparametersforthefreeenergychange approximation based on a structured support vector machine [43]. Combining with L1 regularization, MXfold has been shown to produce the best prediction accuracy while Preface ix avoidingoverfitting.OtherthanthestructureofsingleRNAs,thepredictionofRNA–RNA interactions has been actively studied because those interactions are tightly related to the functionofncRNAs,suchasmiRNAs,siRNAs,andlongncRNAs.Whilethereissubstantial spacetoexploreforpotentialbindingregionsgenome-wide,RIblastandRIsearch2applya seed-and-extend-basedalignmentstrategytospeedupandimprovethediscoveryofhighly complementary regions that can form a stable structure with the queried RNA [44]. By changingthesettingfortheseedsearchsteptouseafastcomputationlibrary,itenablesthe efficientdiscoveryofthecandidateregionsthatcanformstablestructureswithbasepairs.As anintroductionofvariousRNA-RNAinteractionpredictions,Fukunagaetal.haveprovided acomprehensivesurveyonthepredictionwebservicesincludingtheirLncRRIsearch[45]. ApplicationofRNASecondaryStructurePrediction Thanksto highlyaccuratestructureprediction,the comparisonofpredictedstructurescan be further applied to the functional analysis of RNAs. In particular, the comparison of predicted structure stability with experimentally determined accessibility has the potential torevealtheexistenceoftheexternalfactorsthatcausestructurealterationincludingRNA bindingproteinsorbasemodification[46].WhileRNAsecondarystructurecanbeexperi- mentally determined by several approaches (e.g., X-ray crystal structure analysis, cryo- electron microscopy, and NMR), they require the appropriate concentrations of a single RNAandsufferfromproblemsoffeasibilityandthroughputlimitations.Ahigh-throughput structure probing method is an approach that can overcome the throughput and coverage problems of existing conformational analysis methods. By using a high-throughput sequencing technique, this method detects RNA modification at reactive sites. Takizawa has introduced a means of inferring the reactivity of each base from high-throughput structure probing data using the computational methods PROBer [47], BUMHMM [48], and reactIDR [13]. In Chap. 13, the author applied these pipelines to discover the structuralconstraintsontheRNAgenomeofinfluenzavirus[14]. Similarly, the combination of RNA binding protein (RBP) pulldown and high- throughputRNAsequenceanalysishasbeenwidelyappliedattranscriptome-widetoreveal the mechanism of RBP and (m)RNA. However, this strategy is susceptible to biased back- grounds and false positives derived from the pulldown assay step, even when UV cross linking is included, as is used in CLIP-seq and its alternatives [49]. For the accurate inference of sequential and structural RBP binding motifs, an artificial library of random short RNAs is utilized in SELEX [50] and RNAcompete [51], enabling an efficient motif analysis with high coverage. These methods solve the practical problem of capturing the desiredaptamerswithabindingaffinitythatishighenoughforthetargetRBPamongapool ofrandomRNAs.ResidualBind,introducedinthisseries,isadeeplearningframeworkfor inferring RBP binding motifs from experimental RBP binding data [52]. The outstanding characteristic of ResidualBind is that it can perform a global importance analysis for the existenceofmotifstoimprovethemodelinterpretabilityoftheirdeeplearningmodel. The functional domains of RNAs, particularly of ncRNAs, have been discovered by examiningtheconservationofnotonlysequencesbutalsosecondarystructuresrequiredto interact with other molecules to perform biological roles. One example of the potential conservation signature is a pattern found in RNA structure-aware alignment [53]. Specifi- cally,itisdefinedasaco-occurrenceoftwomutationsatpairedbasesthatdoesnotdisrupt theirpairing[54].WalterCostaetal.(2019)developedanovelapproach,calledSSS-test,to evaluate the significance of positive and negative selection on RNA structures based on

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.