ebook img

by Abraham Heifets A thesis submitted in conformity with - TSpace PDF

155 Pages·2013·7.62 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview by Abraham Heifets A thesis submitted in conformity with - TSpace

AUTOMATED SYNTHETIC FEASIBILITY ASSESSMENT: A DATA-DRIVEN DERIVATION OF COMPUTATIONAL TOOLS FOR MEDICINAL CHEMISTRY by AbrahamHeifets Athesissubmittedinconformitywiththerequirements forthedegreeofDoctorofPhilosophy GraduateDepartmentofComputerScience UniversityofToronto (cid:13)c Copyright2014byAbrahamHeifets Abstract AutomatedSyntheticFeasibilityAssessment: AData-drivenDerivationofComputationalToolsforMedicinal Chemistry AbrahamHeifets DoctorofPhilosophy GraduateDepartmentofComputerScience UniversityofToronto 2014 The planning of organic syntheses, a critical problem in chemistry, can be directly modeled as resource- constrained branching plans in a discrete, fully-observable state space. Despite this clear relationship, the full artilleryofartificialintelligencehasnotbeenbroughttobearonthisproblemduetoitsinherentcomplexityand multidisciplinarychallenges. Inthisthesis,Idescribeamappingbetweenorganicsynthesisandheuristicsearch andbuildaplannerthatcansolvesuchproblemsautomaticallyattheundergraduatelevel. Alongtheway,Ishow theneedforpowerfulheuristicsearchalgorithmsandbuildlargedatabasesofsyntheticinformation,whichIuse toderiveaqualitativelynewkindofheuristicguidance. ii Contents Relationtopublishedwork xi 1 Introduction 1 2 Prolegomenatoanyfutureautomatedsynthesisplanning 6 2.1 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 AND/ORgraphsearchalgorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 AO* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 LearninginDepth-FirstSearch(LDFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Thequestionofcyclesemantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.4 ProofNumberSearch(PNS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.5 PN* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.6 ProofDisproofSearch(PDS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.7 Depth-First Proof Number and variants (DFPN, DFPN+, DFPN(r), DFPN-TCA, and DFPN-SNDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Theshapesofchemistry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4 Challengesoforganicsynthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Automatedorganicsynthesisplanners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.1 LHASAanditsvariants(interactivesynthesisplanners) . . . . . . . . . . . . . . . . . . 29 2.5.2 Noninteractivesynthesisplanners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.6.1 Complexity-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6.2 Fragment-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.6.3 Machinelearningandretrosyntheticanalysis . . . . . . . . . . . . . . . . . . . . . . . . 40 2.7 Reactionlibraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 iii 2.8 Thecriticalneedforsearchguidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3 Adeclarativedescriptionofchemistry 49 3.1 Definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4 Retrosyntheticsearchalgorithms 55 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2 AProofNumberSearch-basedSolver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3 ConstructingaPublicChemistryBenchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4 ResultsandDiscussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.5 ChapterConclusionandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5 Compilationofsynthesisandchemicaldata 66 5.1 SCRIPDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2.1 Patentsasasourceofchemicalimages . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2.2 Patentsasbiomedicalliterature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2.3 Patentsasareactiondatabase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2.4 Patentsasabioisosterecatalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3 ChapterConclusionandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6 Domain-specificheuristicsforsyntheticfeasibility 75 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.1.1 Humans,eh? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.1.2 Objectivemeasuresofsyntheticfeasibility . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.2 MaterialsandMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2.1 Datacollection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2.2 Datacleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.2.3 Datalabeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.2.4 Datamodeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.5 ChapterConclusionandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7 Summary&Conclusion 100 iv Appendices 103 A Glossary 104 B Whichmoleculesshouldbebuilt? 107 B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 B.2 Systemandmethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 B.2.1 Correspondenceofboundligands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 B.2.2 Ligandalignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 B.2.3 Residueclusterextractionviacliquedetection. . . . . . . . . . . . . . . . . . . . . . . . 115 B.3 Resultsanddiscussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 B.3.1 Heme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 B.3.2 Nicotinamideadeninedinucleotide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 B.4 ChapterConclusionandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Bibliography 127 v List of Figures 2.1 Cyclical AND/OR graphs. Semicircles connect paths in a single hyperedge. Double circles in- dicategoalstates. Theexamplesaredeliberatelysimple; equivalentyetmore-realisticexamples maybegeneratedbyreplacingsimplearcswithlargersubgraphs. . . . . . . . . . . . . . . . . . 14 2.2 Ifadescendantnodecanbereachedviamultiplepaths,thenEquations2.1and2.2arenolonger correct. An example schematic when a node has repeated precursors showing the (dis)proof counts will be incorrect. In this case, there are only 3 leaf nodes but the root reports a count of4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 GraphhistoryinteractionproblemfromKishimotoandMu¨ller(2004). AssumenodeDisaloss for the player at the root. Node B is an AND node, marked by an semicircle connecting its out arcs. NodesC,G,andHareANDnodesaswellbutarenotmarkedbecausetheyhaveasingle outgoingarc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Aspirin. Vertices labeled C, H, or O denote carbon, hydrogen, and oxygen atoms, respectively. Edgesdrawnwithsinglelinesdenotesinglebondsanddoublelinesrepresentdoublebonds. Typ- ically,bondstohydrogenarenotdrawn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5 Esterificationreactionofanalcoholactivesitewithananhydrideactivesitetoproduceanester. Atoms in the molecular fragments are numbered to help the reader track bond changes. When atomiclabelsareomitted,thevertexispresumedtobeacarbonatomwithsufficienthydrogensto total4bonds. Reactionconditionshavebeenomittedforsimplicity. . . . . . . . . . . . . . . . . 26 2.6 Synthesis of aspirin (right column) from carbon dioxide, sodium hydroxide, phenol, acetic acid and ketene starting materials (left column) via the precursor molecules, salicylic acid (top mid) andaceticanhydride(bottommid). Thefinalaspirin-formingstepisanapplicationoftheesteri- ficationreactiondepictedinFig2.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.7 Palytoxin, a 409-atom molecule synthesized in 1994 (Suh and Kishi, 1994). Bonds depicted as wedgesindicatethebondisangledoutoftheplaneofthepapertowardthereader.Bondsdepicted asdashesindicatethebondisangledawayfromthereader. . . . . . . . . . . . . . . . . . . . . 28 vi 2.8 ExamplefromTodd(2005). Structure(40)depictsthequinoneDiels-Alderreaction,while(41)- (43) show natural products that had been synthesized using the quinone Diels-Alder. LHASA doesnotapplythequinoneDAtothesecases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.9 8stepsynthesisfromTakahashietal.(1990). Molecularcomplexityproceedsnonmonotonically. . 32 2.10 Figure 19 from Boda et al. (2007) depicting minimum, maximum, and average synthetic acces- sibilityscoresbyfivemedicinalchemists. Structuresaresortedbyaveragescore. Moleculesof particularlywidescoredisagreementarelabeled. Mostmoleculeshavescoresof4±1andmany haverangeswhichoverlap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.11 Figure7fromHuangetal.(2011)depictingminimum,maximum,andaveragesyntheticaccessi- bilityscoresbyfivemedicinalchemists. Structuresaresortedbyaveragescore. Mostmolecules havescoresof4±1andmanyhaverangeswhichoverlap. . . . . . . . . . . . . . . . . . . . . . 35 2.12 Examplecomparisonofsyntheticcomplexitymeasures, takenfromBaroneandChanon(2001). Syntheticprogressisnonmonotonic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.13 Reaction example from Pirok et al. (2006) depicting a Friedel-Crafts Acylation reaction. Ad- ditional properties specify the charge necessary at the active site for the reaction to complete. Problematiccompoundsareexcludedwithadditionalpatterns. . . . . . . . . . . . . . . . . . . . 42 2.14 Diels-AlderexamplefromWilcoxandLevinson(1986). Lines1and4showtworeactions. Lines 2 and 3 are the MXC and CXC, respectively, for the first reaction. Line 5 shows the maximum commonsubstructurefromthetworeactions(notetheactivatingelectron-withdrawingoxygenon thedieneophile). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.15 ExamplefromLawetal.(2009). Figure(a)depictsasamplereaction,while(b)and(c)showan extractedcoreandextendedcore,respectively. ComparetotheMXCandCXCfromFigure2.14. Lawetal. notethatthenon-chemically-essentialatom2iscorrectlynotincludedintheextended core;incontrast,abondradiusapproachsuchas(SatohandFunatsu,1999)wouldhaveincludedit. 46 4.1 The benchmark target molecules. Images generated directly from the problem definition using OpenBabelO’Boyleetal.(2011a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Computer-generatedAtorvastatinsynthesismatchingthesynthesisreportedinBroweretal.(1992) andRoth(2002). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 vii 5.1 Cumulative SCRIPDB content. Although SCRIPDB includes patents from 2011, we show data through2010,thelastcompleteyear. (a)LeftpanelshowsthenumberofChemDrawCDXstruc- turefiles,andthestructuresdescribedtherein,availableinSCRIPDBforvariousyears.SCRIPDB contains 4,814,913 CDX files from 2001 through 2010, comprising 10,840,646 molecules. Du- plicate molecules were filtered from each patent but not across patents, as described in the text. (b)Rightpanelshowsthenumberofpatentsanddetailsthesubsetcontainingreactions. For2001 through2010,SCRIPDBcontains107,560patents,ofwhich25,048containsyntheticreactions. . 68 5.2 Example Markush structure from US Patent 6,268,504 (Raveendranath et al., 1999) defining a chemicalclassviasubstituent,positional,andfrequencyvariations. . . . . . . . . . . . . . . . . 68 5.3 Structures per patent in SCRIPDB. While 34,789 patents contain ten or fewer structures, two- thirdsofpatentscontainmorethantenand1,296patentscontainmorethanathousandstructures. 69 5.4 NumberofCDXdatafileswiththatcontainvariousnumbersofreactionarrows. . . . . . . . . . 70 5.5 Sampleofsearchresultsformoleculescontaininganacridinesubstructure. . . . . . . . . . . . . 71 6.1 ChartfromBodaetal.(2007)depictingminimum,maximum,andaveragesyntheticaccessibility scores by five medicinal chemists. Structures are sorted by average score. Molecules of par- ticularly wide score disagreement are labeled. (This is Figure 2.10, duplicated here for reader convenience.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.2 Chart from Ertl and Schuffenhauer (2009) depicting the average of chemist scores for 40 test molecules,inblue. Theerrorbarsdenotethestandarderrorofthemeanofscoresby9chemists. Therefore, to translate to standard deviation the bars would need to be tripled in size. Struc- turesaresortedbyaveragescore. Moleculesofparticularlywidescoredisagreementbetweenthe averagechemistscore,inblue,andtheirsoftware’sscores,inred,arelabeled. . . . . . . . . . . . 79 6.3 Data flow schematic depicting the processing steps in each of the collection, cleaning, labeling, andmodelingstages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.4 ExamplesynthesisfromUSPatent7,238,717(Flemingetal.,2007). “ProcedureA(Synthesisof Weinreb Amides)” is described in the patent as: “ Asubstitutedanthranilicacid(24mmol)wasdissolved inacetonitrile(200mL),1.05equivalentsofN,Odimethylhydroxylaminehydrochloride,1.05equivalentsofEDC,0.05 equivalents of dimethylaminopyridine and 1.0 equivalent of triethylamine were added and the reaction was stirred at roomtemperatureovernight.Theacetonitrilewasremovedbyrotaryevaporationandtheresiduewaspartitionedbetween ethyl acetate and water. The organic layer was washed with brine then concentrated to a residue. The residue was chromatographedonsilicagel(ethylacetateaseluent)togivetheproduct.Typicalyieldsare70-90%”. . . . . . . . . 83 viii 6.5 AvailableboundingboxesaroundgeometricentitiesintheChemDrawfilesdistributedwithUSPTO patents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.6 Difficultvisualparsingexample.Whichmoleculesarereagentsandwhichareproducts,forwhich reactionstep? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.7 CDX reading example. First bounding boxes (in blue) are computed for each molecule. Then theminimumdistancebetweenfragmentsismeasured(ingreen),andusedtocomputemeanand standarddeviationforinter-moleculedistance. Moleculesintersectingwith2standarddeviations from an arrow head or tail (dashed purple arrow) are considered possible reagents or products, andareclassifiedasdescribedinthetext. Thethreebottommoleculesoverlapand,therefore,are groupedintoanenclosingboundingbox(inorange). Becausethisboxenclosesanarrow, these moleculesarerejected. Similarly,becausethecentralarrowintersectslinesegments(inthecenter blackcircles)wecannotassignuniquedirectionforthearrowanditisrejected. . . . . . . . . . . 86 6.8 Illustrative schematic of organic synthesis. Although there are 8 reaction arrows depicted, the A→Breactionisrepeatedfourtimesanditislikelythatonly5reactionswererunatthebench. AssumingthatAandCarestartingmaterials,Algorithm5labelsAandCasrequiringzerosteps, Brequiring1,Drequiring2,Erequiring3,Frequiring4,andGrequiring5. . . . . . . . . . . . 88 6.9 Distributionofthelabelsformoleculesinthetrainingdata. . . . . . . . . . . . . . . . . . . . . 93 6.10 The predictive performance on the automatically-labelled training data for the top 3 algorithms fromTable6.1andrandomforestforcomparison. TheplotsweregeneratedwithR;thelinearfit, R2andp-valueswerecomputedusingR’slmlinearmodel. . . . . . . . . . . . . . . . . . . . . 95 6.11 The predictive performance on the automatically-labelled external validation data for the top 3 algorithmsfromTable6.1andrandomforestforcomparison. TheplotsweregeneratedwithR; thelinearfit,R2andp-valueswerecomputedusingR’slmlinearmodel. . . . . . . . . . . . . . 96 6.12 ThepredictiveperformanceonthemanuallyscoreddatafromErtlandSchuffenhauer(2009)for thetop3algorithmsfromTable6.1andrandomforestforcomparison. Theplotsweregenerated withR;thelinearfit,R2andp-valueswerecomputedusingR’slmlinearmodel.Forcomparison, ErtlandSchuffenhauer(2009)reportthatther2 among9chemistsonthese40moleculesranged from0.450to0.892,withanaverageof0.718. . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 B.1 LigAlignFlowchart. Thethreestagesofligand-basedactivesitealignment,namelyligandcorre- spondence,ligandalignment,andclusterdetection,areshownalongwiththetechniquesusedto accomplisheachstage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 ix B.2 Topological vectors encode atom counts as a function of bond distance from a selected atom. Forexample,thenumbersnexttotheatomsdepicttheminimalbonddistancefromatomA.The topologyvectorforBis[3,5,1]indicatingthreeatomsatdistance1,fiveatomsatdistance2,and oneatomatdistance3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 B.3 AsequencealignmentofP450homologsandextractedpatterns,ascomputedbyMUSCLEEdgar (2004)usingdefaultsettings. ThecolorscorrespondtothedefaultClustalcoloringscheme. Four biologically-significantregionsNebel(2006)areboxedandnumberedandtheconservedcysteine residue from PROSITE pattern PS00086 is marked. The first five rows comprise the input pro- teins; the next five rows show the subset of residues marked as conserved by LigAlign; the last tworowsaretheconservedresiduesintheresultsreportedbyNebel. . . . . . . . . . . . . . . . 118 B.4 Aligand-basedalignmentofP450homologsshowingthesharedhememoiety(coloredtan)sur- rounded by the residue clusters detected by LigAlign. The patterns labelled in Figure B.3 are distinguishedbycolor. Pattern1isshowninpurple;pattern2isdarkblue;pattern3isgreen;and pattern4ispresentedinred. Clusteredresidueswhicharenotmembersofoneofthesepatterns areshowninlightgrey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 B.5 A rigid minimal-RMSD ligand-based alignment of the NAD ligand from the 21 NAD-binding proteinslistedinTableB.1.ThesharedNADligandiscoloredtan.Consistentclustersofresidues are shown surrounding the ligand. Hydrophobic residues (LVAGIMP) are depicted as orange lines; acidic residues (DE) are shown as pink lines. Red spheres correspond to the location of structurallyconservedwatermolecules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 B.6 FragmentationoftheboundNADligandfrom1HDR(eachrigidfragmentisshowninadifferent color), after fragmentation but before alignment to the green pivot NAD ligand in 1HDX. The NADmoietiesofnicotine,nicotineribose,nicotinephosphate,adeninephosphate,adenineribose, andadeninearelabeled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 B.7 AproteinalignmentusingonlythenicotinemoietyoftheNADligand(showninadashedcircle). ThesharedNADligandiscoloredtan. Themisalignmentinducedbynicotinealignmentonthe non-nicotine moieties is apparent. Consistent clusters of residues, as listed in Table B.2, are shownsurroundingtheligand. TheunderlinedC5clusterwasidentifiedbyflexiblealignmentbut was not detected by rigid alignment. Hydrophobic residues (LVAGIMP) are depicted as orange lines;basicresidues(RHK)arebluelines. Redspherescorrespondtothelocationofstructurally conservedwatermolecules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 x

Description:
the need for powerful heuristic search algorithms and build large databases of synthetic Uniform-cost search expands the least-cost leaf node first.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.