ebook img

Algorithms in Bioinformatics: Second International Workshop, WABI 2002 Rome, Italy, September 17–21, 2002 Proceedings PDF

543 Pages·2002·5.331 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Algorithms in Bioinformatics: Second International Workshop, WABI 2002 Rome, Italy, September 17–21, 2002 Proceedings

Preface We are pleased to present the proceedings of the Second Workshop on Algo- rithms in Bioinformatics (WABI 2002), which took place on September 17-21, 2002 in Rome, Italy. The WABI workshop was part of a three-conference meet- ing, which, in addition to WABI, included the ESA and APPROX 2002. The three conferences are jointly called ALGO 2002, and were hosted by the Fac- ulty of Engineering, University of Rome “La Sapienza”. See http://www.dis. uniroma1.it/˜algo02 for more details. The Workshop on Algorithms in Bioinformatics covers research in all areas of algorithmic work in bioinformatics and computational biology. The emphasis is on discrete algorithms that address important problems in molecular biology, genomics,andgenetics,thatarefoundedonsoundmodels,thatarecomputation- ally efficient, and that have been implemented and tested in simulations and on real datasets. The goal is to present recent research results, including significant work in progress, and to identify and explore directions of future research. Original research papers (including significant work in progress) or state- of-the-art surveys were solicited on all aspects of algorithms in bioinformatics, including, but not limited to: exact and approximate algorithms for genomics, genetics, sequence analysis, gene and signal recognition, alignment, molecular evolution, phylogenetics, structure determination or prediction, gene expression and gene networks, proteomics, functional genomics, and drug design. We received 83 submissions in response to our call for papers, and were able to accept about half of the submissions. In addition, WABI hosted two invited, distinguished lectures, given to the entire ALGO 2002 conference, by Dr. Ehud ShapirooftheWeizmannInstituteandDr.GeneMyersofCeleraGenomics.An abstract of Dr. Shapiro’s lecture, and a full paper detailing Dr. Myers lecture, are included in these proceedings. We would like to sincerely thank all the authors of submitted papers, and the participants of the workshop. We also thank the program committee for theirhardworkinreviewingandselectingthepapersfortheworkshop.Wewere fortunate to have on the program committee the following distinguished group of researchers: Pankaj Agarwal (GlaxoSmithKline Pharmaceuticals, King of Prussia) Alberto Apostolico (Universita` di Padova and Purdue University, Lafayette) Craig Benham (University of California, Davis) Jean-Michel Claverie (CNRS-AVENTIS, Marseille) Nir Friedman (Hebrew University, Jerusalem) Olivier Gascuel (Universit´e de Montpellier II and CNRS, Montpellier) Misha Gelfand (IntegratedGenomics, Moscow) Raffaele Giancarlo (Universit`a di Palermo) VI Preface David Gilbert (University of Glasgow) Roderic Guigo (Institut Municipal d’Investigacions M`ediques, Barcelona, co-chair) Dan Gusfield (University of California, Davis, co-chair) Jotun Hein (University of Oxford) Inge Jonassen (Universitetet i Bergen) Giuseppe Lancia (Universita` di Padova) Bernard M.E. Moret (University of New Mexico, Albuquerque) Gene Myers (Celera Genomics, Rockville) Christos Ouzonis (European Bioinformatics Institute, Hinxton Hall) Lior Pachter (University of California, Berkeley) Knut Reinert (Celera Genomics, Rockville) Marie-France Sagot (Universit´e Claude Bernard, Lyon) David Sankoff (Universit´e de Montr´eal) Steve Skiena (State University of New York, Stony Brook) Gary Stormo (Washington University, St. Louis) Jens Stoye (Universita¨t Bielefeld) Martin Tompa (University of Washington, Seattle) Alfonso Valencia (Centro Nacional de Biotecnolog´ıa, Madrid) Martin Vingron (Max-Planck-Institut fu¨r Molekulare Genetik, Berlin) Lusheng Wang (City University of Hong Kong) Tandy Warnow (University of Texas, Austin) We also would like to thank the WABI steering committee, Olivier Gascuel, JotunHein,RaffaeleGiancarlo,ErikMeineche-Schmidt,andBernardMoret,for inviting us to co-chair this program committee, and for their help in carrying out that task. We are particularly indebted to Terri Knight of the University of California, Davis,RobertCastelooftheUniversitatPompeuFabra,Barcelona,andBernard MoretoftheUniversityofNewMexico,Albuquerque,fortheextensivetechnical and advisory help they gave us. We could not have managed the reviewing process and the preparation of the proceedings without their help and advice. Thanks again to everyone who helped to make WABI 2002 a success. We hope to see everyone again at WABI 2003. July, 2002 Roderic Guigo´ and Dan Gusfield Table of Contents Simultaneous Relevant Feature Identification and Classification in High-Dimensional Spaces ......................... 1 L.R.Grate(LawrenceBerkeleyNationalLaboratory),C.Bhattacharyya, M.I. Jordan, and I.S. Mian (University of California Berkeley) Pooled Genomic Indexing (PGI): Mathematical Analysis and Experiment Design............................................. 10 M. Csu˝ro¨s (Universit´e de Montr´eal), and A. Milosavljevic (Human Genome Sequencing Center) Practical Algorithms and Fixed-Parameter Tractability for the Single Individual SNP Haplotyping Problem .................... 29 R.Rizzi(Universita`diTrento),V.Bafna,S.Istrail(CeleraGenomics), and G. Lancia (Universita` di Padova) Methods for Inferring Block-Wise Ancestral History from Haploid Sequences............................................. 44 R. Schwartz, A.G. Clark (Celera Genomics), and S. Istrail (Celera Genomics) Finding Signal Peptides in Human Protein Sequences Using Recurrent Neural Networks .................................... 60 M. Reczko (Synaptic Ltd.), P. Fiziev, E. Staub (metaGen Pharmaceu- ticals GmbH), and A. Hatzigeorgiou (University of Pennsylvania) Generating Peptide Candidates from Amino-Acid Sequence Databases for Protein Identification via Mass Spectrometry ....................... 68 N. Edwards and R. Lippert (Celera Genomics) Improved Approximation Algorithms for NMR Spectral Peak Assignment . 82 Z.-Z. Chen (Tokyo Denki University), T. Jiang (University of Califor- nia, Riverside), G. Lin (University of Alberta), J. Wen (University of California, Riverside), D. Xu, and Y. Xu (Oak Ridge National Labo- ratory) Efficient Methods for Inferring Tandem Duplication History ............. 97 L.Zhang(Nat.UniversityofSingapore),B.Ma(UniversityofWestern Ontario), and L. Wang (City University of Hong Kong) Genome Rearrangement Phylogeny Using Weighbor .................... 112 L.-S. Wang (University of Texas at Austin) VIII Table of Contents Segment Match Refinement and Applications.......................... 126 A.L. Halpern (Celera Genomics), D.H. Huson (Tu¨bingen University), and K. Reinert (Celera Genomics) Extracting Common Motifs under the Levenshtein Measure: Theory and Experimentation ........................................ 140 E.F. Adebiyi and M. Kaufmann (Universita¨t Tu¨bingen) Fast Algorithms for Finding Maximum-Density Segments of a Sequence with Applications to Bioinformatics...................... 157 M.H.Goldwasser(LoyolaUniversityChicago),M.-Y.Kao(Northwest- ern University), and H.-I. Lu (Academia Sinica) FAUST: An Algorithm for Extracting Functionally Relevant Templates from Protein Structures............................................. 172 M. Milik, S. Szalma, and K.A. Olszewski (Accelrys) Efficient Unbound Docking of Rigid Molecules ......................... 185 D. Duhovny, R. Nussinov, and H.J. Wolfson (Tel Aviv University) A Method of Consolidating and Combining EST and mRNA Alignments to a Genome to Enumerate Supported Splice Variants .................. 201 R. Wheeler (Affymetrix) AMethodtoImprovethePerformanceofTranslationStartSiteDetection and Its Application for Gene Finding................................. 210 M. Pertea and S.L. Salzberg (The Institute for Genomic Research) Comparative Methods for Gene Structure Prediction in Homologous Sequences ........................................... 220 C.N.S. Pedersen and T. Scharling (University of Aarhus) MultiProt – A Multiple Protein Structural Alignment Algorithm......... 235 M. Shatsky, R. Nussinov, and H.J. Wolfson (Tel Aviv University) A Hybrid Scoring Function for Protein Multiple Alignment.............. 251 E. Rocke (University of Washington) Functional Consequences in Metabolic Pathways from Phylogenetic Profiles .......................................... 263 Y. Bilu and M. Linial (Hebrew University) Finding Founder Sequences from a Set of Recombinants ................ 277 E. Ukkonen (University of Helsinki) Estimating the Deviation from a Molecular Clock ...................... 287 L. Nakhleh, U. Roshan (University of Texas at Austin), L. Vawter (Aventis Pharmaceuticals), and T. Warnow (University of Texas at Austin) Table of Contents IX ExploringtheSetofAllMinimalSequencesofReversals–AnApplication to Test the Replication-Directed Reversal Hypothesis................... 300 Y.Ajana,J.-F.Lefebvre(Universit´edeMontr´eal),E.R.M.Tillier(Uni- versity Health Network), and N. El-Mabrouk (Universit´e de Montr´eal) Approximating the Expected Number of Inversions Given the Number of Breakpoints.................................... 316 N. Eriksen (Royal Institute of Technology) Invited Lecture – Accelerating Smith-Waterman Searches ............... 331 G. Myers (Celera Genomics) and R. Durbin (Sanger Centre) Sequence-Length Requirements for Phylogenetic Methods ............... 343 B.M.E.Moret(UniversityofNewMexico),U.Roshan,andT.Warnow (University of Texas at Austin) Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution Principle ........................... 357 R.Desper(NationalLibraryofMedicine,NIH)andO.Gascuel(LIRMM) NeighborNet: An Agglomerative Method for the Construction of Planar Phylogenetic Networks..................................... 375 D. Bryant (McGill University) and V. Moulton (Uppsala University) On the Control of Hybridization Noise in DNA Sequencing-by-Hybridization................................. 392 H.-W.Leong(NationalUniversityofSingapore),F.P.Preparata(Brown University), W.-K. Sung, and H. Willy (National University of Singa- pore) Restricting SBH Ambiguity via Restriction Enzymes ................... 404 S. Skiena (SUNY Stony Brook) and S. Snir (Technion) Invited Lecture – Molecule as Computation: Towards an Abstraction of Biomolecular Systems ...................... 418 E. Shapiro (Weizmann Institute) Fast Optimal Genome Tiling with Applications to Microarray Design and Homology Search .............................................. 419 P. Berman (Pennsylvania State University), P. Bertone (Yale Univer- sity), B. DasGupta (University of Illinois at Chicago), M. Gerstein (Yale University), M.-Y. Kao (Northwestern University), and M. Sny- der (Yale University) Rapid Large-Scale Oligonucleotide Selection for Microarrays............. 434 S. Rahmann (Max-Planck-Institute for Molecular Genetics) X Table of Contents Border Length Minimization in DNA Array Design..................... 435 A.B. Kahng, I.I. Ma˘ndoiu, P.A. Pevzner, S. Reda (University of Cal- ifornia at San Diego), and A.Z. Zelikovsky (Georgia State University) The Enhanced Suffix Array and Its Applications to Genome Analysis..... 449 M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch (University of Bielefeld) The Algorithmic of Gene Teams ..................................... 464 A. Bergeron (Universit´e du Qu´ebec a Montreal), S. Corteel (CNRS - Universit´edeVersailles),andM.Raffinot(CNRS-LaboratoireG´enome et Informatique) Combinatorial Use of Short Probes for Differential Gene Expression Profiling ............................. 477 L.L. Warren and B.H. Liu (North Carolina State University) Designing Specific Oligonucleotide Probes for the Entire S. cerevisiae Transcriptome............................. 491 D. Lipson (Technion), P. Webb, and Z. Yakhini (Agilent Laboratories) K-ary Clustering with Optimal Leaf Ordering for Gene Expression Data.. 506 Z. Bar-Joseph, E.D. Demaine, D.K. Gifford (MIT LCS), A.M. Hamel (WilfridLaurierUniversity),T.S.Jaakkola(MITAILab),andN.Sre- bro (MIT LCS) Inversion Medians Outperform Breakpoint Medians in Phylogeny Reconstruction from Gene-Order Data.................... 521 B.M.E. Moret (University of New Mexico), A.C. Siepel (University of California at Santa Cruz), J. Tang, and T. Liu (University of New Mexico) Modified Mincut Supertrees ......................................... 537 R.D.M. Page (University of Glasgow) Author Index ................................................. 553 Simultaneous Relevant Feature Identification and Classification in High-Dimensional Spaces L.R. Grate1, C. Bhattacharyya2,3, M.I. Jordan2,3, and I.S. Mian1 1 Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley CA 94720 2 Department of EECS, University of California Berkeley, Berkeley CA 94720 3 Department of Statistics, University of California Berkeley, Berkeley CA 94720 Abstract. Molecular profiling technologies monitor thousands of tran- scripts, proteins, metabolites or other species concurrently in biologi- calsamplesofinterest.Giventwo-class,high-dimensionalprofilingdata, nominal Liknon [4] is a specific implementation of a methodology for performing simultaneous relevant feature identification and classifica- tion. It exploits the well-known property that minimizing an l norm 1 (via linear programming) yields a sparse hyperplane [15,26,2,8,17]. This work (i) examines computational, software and practical issues required to realize nominal Liknon, (ii) summarizes results from its application to five real world data sets, (iii) outlines heuristic solutions to problems posed by domain experts when interpreting the results and (iv) defines some future directions of the research. 1 Introduction Biologistsandcliniciansareadoptinghigh-throughputgenomics,proteomicsand relatedtechnologiestoassistininterrogatingnormalandperturbedsystemssuch asunaffectedandtumortissuespecimens.Suchinvestigationscangeneratedata havingtheformD ={(xn,yn),n∈(1,...,N)}wherexn ∈RP and,fortwo-class data, yn ∈{+1,−1}. Each element of a data point xn is the absolute or relative abundanceofamolecularspeciesmonitored.Intranscriptprofiling,adatapoint represents transcript (gene) levels measured in a sample using cDNA, oligonu- cleotide or similar microarray technology. A data point from protein profiling can represent Mass/Charge (M/Z) values for low molecular weight molecules (proteins) measured in a sample using mass spectroscopy. In cancer biology, profiling studies of different types of (tissue) specimens are motivated largely by a desire to create clinical decision support systems for accurate tumor classification and to identify robust and reliable targets, “biomarkers”, for imaging, diagnosis, prognosis and therapeutic intervention [14,3,13,27,18,23,9,25,28,19,21,24]. Meeting these biological challenges includes addressing the general statistical problems of classification and prediction, and relevant feature identification. Support Vector Machines (SVMs) [30,8] have been employed successfully for cancer classification based on transcript profiles [5,22,25,28]. Although mecha- nisms for reducing the number of features to more manageable numbers include R.Guig´oandD.Gusfield(Eds.):WABI2002,LNCS2452,pp.1–9,2002. (cid:1)c Springer-VerlagBerlinHeidelberg2002 2 L.R. Grate et al. discarding those below a user-defined threshold, relevant feature identification is usually addressed via a filter-wrapper strategy [12,22,32]. The filter generates candidate feature subsets whilst the wrapper runs an induction algorithm to determine the discriminative ability of a subset. Although SVMs and the newly formulated Minimax Probability Machine (MPM) [20] are good wrappers [4], the choice of filtering statistic remains an open question. Nominal Liknon is a specific implementation of a strategy for perform- ing simultaneous relevant feature identification and classification [4]. It exploits the well-known property that minimizing an l norm (via linear programming) 1 yields a sparse hyperplane [15,26,2,8,17]. The hyperplane constitutes the clas- sifier whilst its sparsity, a weight vector with few non-zero elements, defines a small number of relevant features. Nominal Liknon is computationally less de- manding than the prevailing filter–(SVM/MPM) wrapper strategy which treats the problems of feature selection and classification as two independent tasks [4,16]. Biologically, nominal Liknon performs well when applied to real world data generated not only by the ubiquitous transcript profiling technology, but also by the emergent protein profiling technology. 2 Simultaneous Relevant Feature Identification and Classification Consider a data set D = {(xn,yn),n ∈ (1,...,N)}. Each of the N data points (profiling experiments) is a P-dimensional vector of features (gene or protein abundances) xn ∈ RP (usually N ∼ 101 −102;P ∼ 103 −104). A data point n is assigned to one of two classes yn ∈ {+1,−1} such a normal or tumor tis- sue sample. Given such two-class high-dimensional data, the analytical goal is to estimate a sparse classifier, a model which distinguishes the two classes of data points (classification) and specifies a small subset of discriminatory fea- tures(relevantfeatureidentification).AssumethatthedataD canbeseparated by a linear hyperplane in the P-dimensional input feature space. The learning task can be formulated as an attempt to estimate a hyperplane, parameterized in terms of a weight vector w and bias b, via a solution to the following N inequalities [30]: ynzn =yn(wTxn−b)≥0 ∀n={1,...,N} . (1) The hyperplane satisfying wTx−b=0 is termed a classifier. A new data point x (abundances of P features in a new sample) is classified by computing z = wTx−b. If z >0, the data point is assigned to one class otherwise it belongs to the other class. Enumerating relevant features at the same time as discovering a classifier can be addressed by finding a sparse hyperplane, a weight vector w in which most components are equal to zero. The rationale is that zero elements do not contribute to determining the value of z: Simultaneous Relevant Feature Identification and Classification 3 (cid:1)P z = wpxp−b . p=1 If wp = 0, feature p is “irrelevant” with regards to deciding the class. Since only non-zero elements wp (cid:5)=0 influence the value of z, they can be regarded as “relevant” features. The task of defining a small number of relevant features can be equated with that of finding a small set of non-zero elements. This can be formulated as an optimization problem; namely that of minimizing the l norm (cid:6)w(cid:6) , where 0 0 (cid:6)w(cid:6)0 =number of{p:wp (cid:5)=0}, the number of non-zero elements of w. Thus we obtain: min (cid:6)w(cid:6) 0 w,b subject toyn(wTxn−b)≥0 ∀n={1,...,N} . (2) Unfortunately, problem (2) is NP-hard [10]. A tractable, convex approxima- tion to this problem can(cid:2)be obtained by replacing the l0 norm with the l1 norm (cid:6)w(cid:6)1, where (cid:6)w(cid:6)1 = Pp=1|wp|, the sum of the absolute magnitudes of the elements of a vector [10]: (cid:2) mwi,bn(cid:6)w(cid:6)1 = Pp=1|wp| subject to yn(wTxn−b)≥0 ∀n={1,...,N} . (3) A solution to (3) yields the desired sparse weight vector w. Optimization problem (3) can be solved via linear programming [11]. The ensuingformulationrequirestheimpositionofconstraintsontheallowedranges of variables. The introduction of new variables up,vp ∈ RP such that |wp| = up+vp and wp =up−vp ensures non-negativity. The range of wp =up−vp is unconstrained (positive or negative) whilst up and vp remain non-negative. up andvp aredesignatedthe“positive”and“negative”partsrespectively.Similarly, the bias b is split into positive and negative components b = b+−b−. Given a solution to problem (3), either up or vp will be non-zero for feature p [11]: (cid:2) u,vm,bi+n,b− Pp=1(up+vp) subject to yn((u−v)Txn−(b+−b−))≥1 up ≥0;vp ≥0;b+ ≥0;b− ≥0 ∀n={1,...,N};∀p={1,...,P} . (4) Adetaileddescriptionoftheoriginsofthe≥1constraintcanbefoundelsewhere [30]. IfthedataD arenotlinearlyseparable,misclassifications(errorsintheclass labelsyn)canbeaccountedforbytheintroductionofslackvariablesξn.Problem (4) can be recast yielding the final optimization problem, 4 L.R. Grate et al. (cid:2) (cid:2) u,vm,bi+n,b− Pp=1(up+vp)+C Nn=1ξn subject toyn((u−v)Txn−(b+−b−))≥1−ξn up ≥0;vp ≥0;b+ ≥0;b− ≥0;ξn ≥0 ∀n={1,...,N};∀p={1,...,P} . (5) C is an adjustable parameter weighing the contribution of misclassified data points. Larger values lead to fewer misclassifications being ignored: C = 0 cor- responds to no outliers being ignored whereas C →∞ leads to the hard margin limit. 3 Computational, Software and Practical Issues Learning the sparse classifier defined by optimization problem (5) involves min- imizing a linear function subject to linear constraints. Efficient algorithms for solving such linear programming problems involving ∼10,000 variables (N) and ∼10,000 constraints (P) are well-known. Standalone open source codes include lp solve1 and PCx2. Nominal Liknon is an implementation of the sparse classifier (5). It incor- poratesroutineswritteninMatlab3 andasystemutilizingperl4 andlp solve. The code is available from the authors upon request. The input consists of a file containing an N ×(P +1) data matrix in which each row represents a single profiling experiment. The first P columns are the feature values, abundances of molecular species, whilst column P +1 is the class label yn ∈ {+1,−1}. The outputcomprisesthenon-zerovaluesoftheweightvectorw (relevantfeatures), the bias b and the number of non-zero slack variables ξn. The adjustable parameter C in problem (5) can be set using cross validation techniques. The results described here were obtained by choosing C = 0.5 or C =1. 4 Application of Nominal Liknon to Real World Data Nominal Liknon was applied to five data sets in the size range (N = 19, P = 1,987)to(N =200,P =15,154).AdatasetD yieldedasparseclassifier,w and b,andaspecificationofthelrelevantfeatures(P (cid:9)l).Sincetheprofilingstudies produced only a small number of data points (N (cid:10)P), the generalization error of a nominal Liknon classifier was determined by computing the leave-one-out error for l-dimensional data points. A classifier trained using N −1 data points wasusedtopredicttheclassofthewithhelddatapoint;theprocedurerepeated N times. The results are shown in Table 1. Nominal Liknon performs well in terms of simultaneous relevant feature identification and classification. In all five transcript and protein profiling data 1 http://www.netlib.org/ampl/solvers/lpsolve/ 2 http://www-fp.mcs.anl.gov/otc/Tools/PCx/ 3 http://www.mathworks.com 4 http://www.perl.org/

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.