ebook img

Optimal design of experiments by combining coarse and fine measurements PDF

0.96 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Optimal design of experiments by combining coarse and fine measurements

Optimal design of experiments by combining coarse and fine measurements Alpha A. Lee∗ and Michael P. Brenner School of Engineering and Applied Sciences and Kavli Institute of Bionano Science and Technology, Harvard University, Cambridge, MA 02138, USA Lucy J. Colwell† Department of Chemistry, University of Cambridge, CB2 1EW, Cambridge, UK Inmanycontextsitisextremelycostlytoperformenoughhighqualityexperimentalmeasurements toaccuratelyparameterizeapredictivequantitativemodel. However,itisoftenmucheasiertocarry outexperimentsthatindicatewhetheraparticularsampleisaboveorbelowagiventhreshold. Can many such binary or “coarse” measurements be combined with a much smaller number of higher 7 resolution or “fine” measurements to yield accurate models? Here, we demonstrate an intuitive 1 strategy,inspiredbystatisticalphysics,whereinthecoarsemeasurementsidentifythesalientfeatures 0 of the data, while fine measurements determine the relative importance of these features. We 2 illustrateourstrategybyconsideringtheproblemofsolubilitypredictionforsmallorganicmolecule b from their 2D molecular structure. e F A large class of scientific questions asks whether the bining two qualitatively distinct forms of data to build a 1 valuesofdependentvariablescanbeaccuratelypredicted more accurate model. ] given sets of input variables and training data. Classi- In this Letter, we introduce an intuitive method that n cal statistics shows that this is possible with sufficiently combines coarse and fine measurements, inspired by the a - many high resolution measurements, though the cost of methods of statistical physics. We then apply this ap- a performing the required number of experimental mea- proach to the problem of predicting the solubility of t a surements can be prohibitive. Nonetheless, in many set- chemical compounds. We interpret our method as deriv- d tings, it can be straightforward to evaluate whether a ing from data an accurate parameterization of an Ising . s sample measurement is above or below a certain thresh- model that relates the independent variables to the out- c old. put variable. i s y Examples of this problem abound in disparate To fix ideas, we assume each sample is characterized h fields. For example, predicting the solubility of organic by a vector of p properties fi ∈ Rp. We are given bi- p molecules is a fundamental question in physical chem- nary data for numerous samples of which N+ (N−) are [ istry [1]. Although accurate measurement of the aque- above (below) some given threshold. In addition, we are 1 ous solubility of molecules is extremely challenging [2], given y ∈ RM, quantitative measurements for M sam- v measuring whether a molecule is soluble at a particu- ples. These measurements could be biological fitness, 1 lar concentration is comparatively simple. Similarly, in bindingaffinity,solubilityetc. Wecandefinedatamatri- 0 thedrugdiscoveryworkflow, biochemicalassaysthatde- cesR± ∈RN±×p forsamplesabove/belowthethreshold, 0 termine whether a molecule binds to a given receptor with columns normalized to have zero mean and unit 6 0 are relatively simple compared to accurate quantitative variance. Intuitively, if there are combinations of the p . measurements of protein-ligand binding affinity [3]. In propertiesthatarealwayspresentinsamplesaboveorbe- 2 0 protein biophysics, a key challenge is to predict the ef- low the threshold, then these properties should be good 7 fect of amino acid changes on protein phenotype. Here, predictors of the measurement. Such persistent correla- 1 threshold measurements are naturally provided by the tions can be identified from the eigendecomposition of v: set of homologous protein sequences that have similar each sample covariance matrix C± i phenotypic properties [4–8]. In contrast, experimentally X 1 ar mcuelat.suArinreglattheedbpiroopbhlyesmic,aclrpurcoiapletrotieHsIiVs mvauccchinmeodreevedlioffip-- C± = N±R±TR± ment,istopredictthefitnesslandscapeofthevirusgiven (cid:88)N± = λ±u±⊗u±, (1) HIV sequences obtained from patients; again collecting i i i patientsamplesismucheasierthanmeasuringthefitness i=1 directly [9, 10]. where {λ±}, {u±} are the eigenvalues and eigenvectors. i i Despitetheubiquityofthisproblem,toourknowledge Eachu±identifiesaparticularcombinationofthepprop- i there is no principled method for combining numerous erties that explains a fraction λ±/(cid:80) λ± of the data i i i binary (“coarse”) measurements with fewer quantitative variance [11]. The matrix C is an unbiased estima- ± (“fine”) measurements to produce a predictive model. tor of the corresponding true covariance matrix, which Although regression approaches can account for a prior describes the relationships between variables in the dis- estimateofsampleerror[11],thisisnotthesameascom- tribution from which samples are drawn. This raises the 2 question of how the quality of this estimator depends on data sampling, and specifically whether the sample -1L)0 A or 1 B heafeimagavetpeunlevrae,ansloue(neieegsaemsaninevardyectepotiorgoerebflnetorvaceeianncltt,iizooaerrldslymaoraeneasstseahuqyeruessaeaellftmyeca.pr)tle,uelcsriaaewbusilswteihn.itgFcheColrra±traegtxinoe- cted log(S / mol ---642 an Absolute Err000...789 correspondingeigenvalue, eventhoughthesefeaturesare edi-8 Me pr 0.6 uncorrelated with the output variable. For protein se- -10 -5 0 0 500 1000 log(S / mol L-1) Number of quantitative measurements quences, a natural source of such spurious correlation is phylogeny induced by evolutionary dynamics [12, 13]. FIG. 1. Combining coarse and fine measurements accurately Hereweproposethatoveralltheeigenvectorsu± com- predicts solubility. (A) Out-of-sample solubility prediction i puted from C , the covariance matrix of binary data has a mean absolute error of 0.62. The estimate is arrived ± measurements,infactreliablyidentifydatafeatures,but at by analyzing 10 random partitions of the data into train- ing and verification sets. (B) The mean absolute error as a thesignificanceofthesefeaturesasestimatedbythecor- function of the number of quantitative measurements given responding eigenvalues λ± is severely corrupted by im- i tothemodel. Errorbarsareobtainedfromaveragingover10 perfect sampling. Later in this Letter we justify this realizations. ansatz with ideas from statistical physics, showing that thischaracterizationisaccurateforalargeclassofprob- lems. Our ansatz implies a natural strategy is to use the thepresenceofdifferentchemicalgroups[21]—chemical quantitative measurements to determine the significance groups that interact favorably (unfavorably) with water of each feature. In particular, we posit a general bilinear increase (or decrease) solubility. To describe the chemi- model cal groups in each molecule, we concatenate the Avalon Fingerprint [22], the MACCS Fingerprint [23], and the yi =hTfi+fiTJfi+(cid:15)i, (2) 1024-bit Morgan6 Fingerprint [24], using the package rdKit[25],toproduceavectorof1703variablesforeach wherehisthevariable-specificeffect,J capturesthecou- molecule. To generate binary data we label molecules plingbetweenvariables,and(cid:15)i ∼N(0,σ)modelsrandom with solubility >10−2 mol/L as soluble, and those with error. There are p parameters in h and p(p−1)/2 pa- solubility <10−4 mol/L as insoluble. We then diagonal- rameters in J. In principle those parameters could be ize the resulting matrices C after eliminating variables ± estimated using linear least squares regression if one had that are zero for more that 1% of the dataset. For sim- M (cid:29)p(p+1)/2quantitativemeasurements. However,it plicity, we take pˆ = min{p,N }, since it is not known ± ± is costly to perform a large number of detailed measure- a priori whether the number of significant eigenvectors ments, so we turn instead to the features identified from is small compared to the number of variables. We ran- C . We pose the ansatz ± domly partition the dataset into a validation set (10%) and training set, and estimate {h, c+, c−} using linear (cid:88)pˆ+ (cid:88)pˆ− regression with L penalty on the coefficient to prevent J = c+u+⊗u++ c−u−⊗u−, (3) 2 i i i i i i overfitting (ridge regression), with the weight of the L 2 i=1 i=1 penalty estimated using 10 fold cross validation on the where pˆ ≤ p, since eigenvectors with low eigenvalues training set. ± typicallyreflectnoiseduetofinitesampling[14–16]. Our Figure 1A shows that out-of-sample prediction for the ansatz reflects the hypothesis that the eigendecomposi- validation set has a mean absolute error (MAE) of 0.62, tion of C captures variable-variable correlations. If the comparable to state of the art models based on convolu- ± number of significant eigenvectors is much smaller than tionalneuralnetwork[26,27],thoughourmodelissignif- the number of variables, random matrix theory provides icantly more parsimonious. Importantly, the prediction a rigorous way to determine pˆ [14–19]; this case will accuracy of our method remains robust when the num- ± be discussed in detail later. Relaxing this assumption, berofquantitativemeasurementsisreduced(Figure1B). it is necessary to include all eigenvectors and determine The error increases by less than a factor of 2 when the the parameters h, c+ and c− by regressing against the sample size decreases by an order of magnitude — con- the few quantitative measurements available. We note firming that our approach exploits the numerous binary that the ansatz (3), reduces the number of variables to measurements available to construct an accurate model p+pˆ +pˆ . using comparatively few quantitative measurements. + − To illustrate our strategy, we turn to the concrete ex- Tounderstandwhyourheuristicstrategyissuccessful, ample of solubility prediction. We work with solubil- we consider a model problem where data is generated ity data for 1144 organic molecules reported in [20], the according to Eqn. (2), which is the maximum entropy largest freely accessible dataset. It is well known that model used in fields ranging from neuroscience [28, 29] the aqueoussolubilityof a chemical compound relates to to sociophysics [30] and is analogous to the Ising model. 3 Wethusinterpretthedependentvariableasan“energy”, noting that the logarithm of the solubility is indeed pro- A 0.2 B portionaltothesolvationenergy. Theinteractionmatrix 1 J can be decomposed into a sum of outer products of λ) 1 0 eigenvectors ζi (Hopfield patterns [31]), and eigenvalues p(0.5 v E (Hopfield energies) as i -0.2 0 m 0 5 10 -0.4 -0.2 0 0.2 (cid:88) J = E ζ ⊗ζ . (4) λ ζ i i i 1 i=1 0 0.1 C D Furthermore, to model the binary features used in solu- -20 bility prediction, we make the assumption that the inde- MP 0 MP pendent variable is a vector of ±1. J E -40 To simulate binary measurements we randomly draw -0.1 -60 samplesfromtheuniformdistribution,evaluateEqn. (2) -2 0 2 -300 -200 -100 0 100 todeterminetheenergyofeachsample, andretainthose J E samples that fall below a certain energy. Consider an interaction matrix J with p = 100, and m = 3 ran- FIG. 2. Hopfield patterns can be recovered from thresh- domly generated patterns. To fix ideas, for the remain- old sampling. (A) Histogram of eigenvalues agrees with the der of this Letter we let all patterns be attractive with Marˇcenko-Pastur distribution (red curve) save for three sig- (E ,E ,E )=(−30,−25,−20), h=0 and (cid:15) =0. Using 1 2 3 i nificant eigenvalues. (B) The top eigenvector is the lowest this model, we generate 5000 random vectors, take the energy Hopfield pattern; the other eigenvectors are shown in N = 500 samples with lowest energy, and compute the Supplemental Information. Random matrix cleaning allows eigendecomposition of the resulting correlation matrix. us to successfully (C) recover the coupling matrix J and (D) Figure2Ashowsthattheeigenvaluedistributionofthe predict Hopfield energies. samplecorrelationmatrixC followstheMarˇcenko-Pastur distribution expected for a random matrix [32], Since the Hopfield patterns are energy minima, taken (cid:114)(cid:104)(cid:0) √ (cid:1)2 (cid:105) (cid:104) (cid:0) √ (cid:1)2 (cid:105) together Figure 2A-D imply that the probability of the 1+ γ −λ λ− 1− γ ) system visiting a particular basin under uniform sam- + + ρ(λ)= (5) pling is proportional to the energy of that minima. 2πγλ Therefore, the hypervolume of each energy basin is pro- where γ = p/N, except for three distinct eigenvalues. portional to the basin depth. This fact can be derived Figure2Bshowsthatthosestatisticallysignificanteigen- by noting that Eqn. (2) is a quadratic form, so the Hes- vectors are indeed the Hopfield patterns that we put in. sian matrix is a constant. Therefore, all local minima Therefore, the large eigenvectors of the correlation ma- have the same mean curvature. Given that low lying trixofthebinarydatasetcorrespondtoeigenvectorsofJ. energy minima are wide, we can extract the position of Note that the random matrix theory framework applies energy minima in the space of input variables by study- because m (cid:28) p, i.e. the signal is low rank compared to ing the correlation structure of the binary dataset. We the noise. If the signal were high rank, all eigenvectors notethatthecorrelationbetweenbasinhypervolumeand must be included and their significance determined by basin depth appears in many complex physical systems regressionagainstfinemeasurements, asinthesolubility beyond the Ising model [33–35]. example discussed above. A lingering question is whether our inference proce- Turning to eigenvalues, in this model which features dure is robust to the energy threshold. To test this, we uniform sampling, we find that the eigenvalues are pro- consider m Hopfield patterns, chosen as eigenvectors of portional to the Hopfield energy E . This allows us to a symmetrized p×p Gaussian random matrix, with the i “clean” the correlation matrix, by taking the Marˇcenko- Hopfield energy chosen to be Gaussian distributed with Pastur distribution as the null hypothesis, and using the mean10andunitvariance. Wedraw10000samplesran- q eigenvectors above the Marˇcenko-Pastur threshold to domly and compute the correlation matrix with the low- construct a rank q approximation J of the correlation est energy N samples. Figure 3 shows that our method MP matrix (here q = 3). Figure 2C shows that J accu- is robust: the correlation coefficient between J and J MP MP rately reconstructs J, and allows accurate prediction of is large and constant for a wide range of thresholds and theenergyofparticularstates(Figure2D).Analogously, number of Hopfield patterns. The question of how many the eigendecomposition of C allows one to recover re- energy minima can be recovered from the binary data, + pulsive patternswithpositiveHopfieldenergies(seeSup- as well as an interpretation of our result based on the plemental Information). entropy of the energy basins, is discussed in the Supple- 4 1 0.9 1 A 0.2 B ) m = 3 λ 1 0 0.8 p(0.5 v -0.2 0.7 m = 5 0 0 5 10 -0.4 -0.2 0 0.2 0.6 λ ζ m = 10 1 0.5 ρ 0.4 0.2 C 0.2 D 2 0 3 0 0.3 v v -0.2 -0.2 0.2 -0.2 0 0.2 -0.2 0 0.2 0.1 ζ ζ 1 2 0 J 1 0 2000 4000 6000 8000 10000 0.2 E d F N JMP--00..420 edicte 0 FIG. 3. Hopfield inference with random matrix cleaning is -0.6 pr-1 robusttotheenergythreshold. HereρisthePearsoncorrela- -2 0 2 -2 0 2 J J tioncoefficientbetweentheentriesinJ andJ . Theresults MP are computed by averaging over 50 realizations. FIG. 4. Few quantitative measurements enable J to be in- mental Information. ferred accurately for stratified datasets. (A) There are now We now turn to consider two common scenarios that foursignificanteigenvectorsbutstillonlythreeHopfieldpat- terns in the model. (B)-(D) The top eigenvector is uncorre- break the assumptions made so far, stratified sampling lated with all Hopfield patterns, and the Hopfield patterns and more complex energy landscapes, to find out how are demoted to the second to fourth significant eigenvectors. they affect the eigenvectors and eigenvalues. (E) Random matrix cleaning does not recover the coupling Stratifiedsampling: Thusfarweassumedthatthesam- matrix. (F) The coupling matrix can be recovered by incor- pling is uniform before thresholding. However, in many porating quantitative data and using Eqn. (3). settings, the sampling is biased, and some particular in- put variables are not well-sampled. To model this ef- a landscape that comprises a sum of Gaussians fect, we draw 5000 random samples, but freeze the first 5 variables to +1 for the first 2500 samples and the last H(f)=(cid:88)E exp(cid:0)−E2(f ·ζ )2(cid:1). (6) 5 variables to −1 for the remaining 2500 samples. We i i i then evaluate the energy, and take again the lowest 10th i percentile. Figure 4A-D shows that the frozen variables This landscape has the property that the depth of each introduce sample-sample correlations, and now there are energy minima, E (located at ζ ), is inversely propor- i i 4 significant eigenvectors with the first Hopfield pattern tionaltoitswidth1/E . Asabove, welet(E ,E ,E )= i 1 2 3 demoted to the second largest eigenvector (Figure 4C). (−30,−25,−20) and generate Hopfield patterns by diag- As such, the informative eigenvectors are still present in onalising a symmetrized Gaussian random matrix. We JMP, but the eigenvalues are misplaced. draw 5000 samples and threshold to find the lowest 500 Perhaps unsurprisingly, na¨ıve random matrix cleaning samples in energy. Figure 5 shows that there are again does not recover J (Figure 4E), since there is no a priori threesignificanteigenvectorsabovetheMarˇcenko-Pastur reason to discard the first eigenvector unless we know threshold, but the lowest energy Hopfield pattern is de- beforehand the Hopfield patterns. We need additional moted to the third eigenvector, while the highest energy information–forwhichweturntothequantitativemea- Hopfieldpatternispromotedtothetopeigenvector. This surements—toaccuratelyrecovertheHopfieldenergies. is expected: the eigenvalue corresponding to each mini- Figure4Fshowsthatanadditional500quantitativemea- mum is proportional to the number of samples near that surement allow us to recover the coupling matrix using minimum, i.e. the basin volume, which in this case is ridge regression and the ansatz Eqn. (3). not proportional to basin depth. However, the eigen- Complex energy landscapes: The geometric property vectors still indicate the locations of the energy minima. that the depth of an energy minima is related to its Therefore a natural strategy is to use f ·u , the overlaps i hypervolume is not universal to all energy landscapes between the sample vector and each eigenvector, as in- [36, 37]. A natural question is whether the significant puts to a general nonlinear function such as an artificial eigenvectors and eigenvalues in the correlation matrix of neural network. samples below/above an energy threshold allow us to in- Inconclusion, wedevelopageneralstrategy, grounded ferfeaturesofacomplexenergylandscapes. Weconsider in statistical physics, which integrates coarse and fine 5 [5] M. Figliuzzi, H. Jacquier, A. Schug, O. Tenaillon, and M. Weigt, Molecular Biology and Evolution 33, 268 2 A 0.2 B (2015). [6] T.A.Hopf,J.B.Ingraham,F.J.Poelwijk,M.Springer, ) λ 3 0 C. Sander, and D. S. Marks, Nature Biotechnology p(1 v (2017). -0.2 [7] P. Barrat-Charlaix, M. Figliuzzi, and M. Weigt, Scien- 0 tific Reports 6 (2016). 0 5 10 -0.2 0 0.2 [8] R. M. Levy, A. Haldane, and W. F. Flynn, Current λ ζ Opinion in Structural Biology 43, 55 (2017). 1 [9] K.Shekhar,C.F.Ruberman,A.L.Ferguson,J.P.Bar- ton,M.Kardar, andA.K.Chakraborty,PhysicalReview 0.2 0.2 C D E 88, 062705 (2013). [10] A. L. Ferguson, J. K. Mann, S. Omarjee, T. Ndungu, 2 0 1 0 B. D. Walker, and A. K. Chakraborty, Immunity 38, v v 606 (2013). [11] C.M.Bishop,PatternrecognitionandMachineLearning, -0.2 -0.2 Vol. 128 (Springer, 2007). -0.2 0 0.2 -0.2 0 0.2 [12] J.Y.Dutheil,BriefingsinBioinformatics13,228(2012). ζ ζ [13] B.ObermayerandE.Levine,NewJournalofPhysics16, 2 3 123017 (2014). [14] L. Laloux, P. Cizeau, J.-P. Bouchaud, and M. Potters, FIG. 5. For energy landscapes where basin depth is not pro- Physical Review Letters 83, 1467 (1999). portional to basin width, the eigenvectors indicate the loca- [15] V. Plerou, P. Gopikrishnan, B. Rosenow, L. A. N. Ama- tions of energy minima but the eigenvalues are awry. (A) ral, andH.E.Stanley,PhysicalReviewLetters83,1471 There are three significant eigenvectors above the Marˇcenko- (1999). Pasturthreshold. (B)-(D)Thetopeigenvectoriscorrelated [16] Z. Bai and J. W. Silverstein, Spectral analysis of large with the highest energy minimum, and the last significant dimensional random matrices, Vol. 20 (Springer, 2010). eigenvector is correlated with the lowest energy minimum. [17] J.Bun,R.Allez,J.-P.Bouchaud, andM.Potters,IEEE Transactions on Information Theory 62, 7475 (2016). [18] A.A.Lee,M.P.Brenner, andL.J.Colwell,Proceedings of the National Academy of Sciences 113, 13564 (2016). measurements to yield a predictive model. Since coarse [19] J.Bun,J.-P.Bouchaud, andM.Potters,PhysicsReports measurementsareoftensignificantlylesscostlytoobtain, 666, 1 (2017). our strategy provides a new avenue for experiment de- [20] J.S.Delaney,JournalofChemicalInformationandCom- sign. Although our Letter only considered an Ising-type puter Sciences 44, 1000 (2004). model, the fact that the eigenvectors of the correlation [21] R. Skyner, J. McDonagh, C. Groom, T. van Mourik, and J. Mitchell, Physical Chemistry Chemical Physics matrixofcoarsemeasurementspointtowardenergymin- 17, 6174 (2015). ima suggests a natural way to integrate our result into [22] P.Gedeck,B.Rohde, andC.Bartels,JournalofChem- more complex non-linear models. ical Information and Modeling 46, 1924 (2006). The authors thank R Monasson for insightful discus- [23] J. L. Durant, B. A. Leland, D. R. Henry, and J. G. sions. AAL acknowledges the support of the George F. Nourse, Journal of Chemical Information and Computer Carrier Fellowship. LJC acknowledges a Next Genera- Sciences 42, 1273 (2002). tion fellowship, and a Marie Curie CIG [Evo-Couplings, [24] D.RogersandM.Hahn,JournalofChemicalInformation and Modeling 50, 742 (2010). Grant 631609]. MPB is an investigator of the Simons [25] “RDKit: Open-source cheminformatics,” http://www. Foundation,andacknowledgessupportfromtheNational rdkit.org. Science Foundation through DMS-1411694. [26] D.K.Duvenaud,D.Maclaurin,J.Iparraguirre,R.Bom- barell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, in Advances in Neural Information Processing Systems (2015) pp. 2224–2232. [27] S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and ∗ [email protected] P. Riley, Journal of Computer-aided Molecular Design † [email protected] 30, 595 (2016). [1] A. Llinas, R. C. Glen, and J. M. Goodman, Journal of [28] E. Schneidman, M. J. Berry, R. Segev, and W. Bialek, Chemical Information and Modeling 48, 1289 (2008). Nature 440, 1007 (2006). [2] D. S. Palmer and J. B. Mitchell, Molecular Pharmaceu- [29] S. Cocco, S. Leibler, and R. Monasson, Proceedings of tics 11, 2962 (2014). the National Academy of Sciences 106, 14058 (2009). [3] N. Malo, J. A. Hanley, S. Cerquozzi, J. Pelletier, and [30] E. D. Lee, C. P. Broedersz, and W. Bialek, Journal of R. Nadon, Nature Biotechnology 24, 167 (2006). Statistical Physics 160, 275 (2015). [4] F. Morcos, N. P. Schafer, R. R. Cheng, J. N. Onuchic, [31] J. J. Hopfield, Proceedings of the National Academy of andP.G.Wolynes,ProceedingsoftheNationalAcademy Sciences 79, 2554 (1982). of Sciences 111, 12408 (2014). [32] V. A. Marˇcenko and L. A. Pastur, Mathematics of the 6 USSR-Sbornik 1, 457 (1967). densed Matter 23, 053201 (2011). [33] J. P. Doye, D. J. Wales, and M. A. Miller, The Journal [36] D.J.Wales,M.A.Miller, andT.R.Walsh,Nature394, of Chemical Physics 109, 8143 (1998). 758 (1998). [34] J. P. Doye and C. P. Massen, The Journal of Chemical [37] D. Wales, Energy landscapes: Applications to clusters, Physics 122, 084105 (2005). biomolecules and glasses (Cambridge University Press, [35] C. J. Pickard and R. Needs, Journal of Physics: Con- 2003).

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.