Applied Biclustering Methods for Big and High-Dimensional Data Using R Edited by Adetayo Kasim Durham University United Kingdom Ziv Shkedy Hasselt University Diepenbeek, Belgium Sebastian Kaiser Ludwig Maximilian Universität Munich, Germany Sepp Hochreiter Johannes Kepler University Linz Austria Willem Talloen Janssen Pharmaceuticals Beerse, Belgium Editor-in-Chief Shein-Chung Chow, Ph.D., Professor, Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, North Carolina Series Editors Byron Jones, Biometrical Fellow, Statistical Methodology, Integrated Information Sciences, Novartis Pharma AG, Basel, Switzerland Jen-pei Liu, Professor, Division of Biometry, Department of Agronomy, National Taiwan University, Taipei, Taiwan Karl E. Peace, Georgia Cancer Coalition, Distinguished Cancer Scholar, Senior Research Scientist and Professor of Biostatistics, Jiann-Ping Hsu College of Public Health, Georgia Southern University, Statesboro, Georgia Bruce W. Turnbull, Professor, School of Operations Research and Industrial Engineering, Cornell University, Ithaca, New York Published Titles Adaptive Design Methods in Clinical Bayesian Analysis Made Simple: Trials, Second Edition An Excel GUI for WinBUGS Shein-Chung Chow and Mark Chang Phil Woodward Adaptive Designs for Sequential Bayesian Methods for Measures Treatment Allocation of Agreement Alessandro Baldi Antognini Lyle D. Broemeling and Alessandra Giovagnoli Bayesian Methods for Repeated Measures Adaptive Design Theory and Lyle D. Broemeling Implementation Using SAS and R, Bayesian Methods in Epidemiology Second Edition Lyle D. Broemeling Mark Chang Bayesian Methods in Health Economics Advanced Bayesian Methods for Gianluca Baio Medical Test Accuracy Bayesian Missing Data Problems: EM, Lyle D. Broemeling Data Augmentation and Noniterative Applied Biclustering Methods for Big Computation and High-Dimensional Data Using R Ming T. Tan, Guo-Liang Tian, Adetayo Kasim, Ziv Shkedy, and Kai Wang Ng Sebastian Kaiser, Sepp Hochreiter, Bayesian Modeling in Bioinformatics and Willem Talloen Dipak K. Dey, Samiran Ghosh, Applied Meta-Analysis with R and Bani K. Mallick Ding-Geng (Din) Chen and Karl E. Peace Benefit-Risk Assessment in Basic Statistics and Pharmaceutical Pharmaceutical Research and Statistical Applications, Second Edition Development James E. De Muth Andreas Sashegyi, James Felli, Bayesian Adaptive Methods for and Rebecca Noel Clinical Trials Benefit-Risk Assessment Methods in Scott M. Berry, Bradley P. Carlin, Medical Product Development: Bridging J. Jack Lee, and Peter Muller Qualitative and Quantitative Assessments Qi Jiang and Weili He Published Titles Biosimilars: Design and Analysis of Design and Analysis of Bridging Studies Follow-on Biologics Jen-pei Liu, Shein-Chung Chow, Shein-Chung Chow and Chin-Fu Hsiao Biostatistics: A Computing Approach Design & Analysis of Clinical Trials for Stewart J. Anderson Economic Evaluation & Reimbursement: An Applied Approach Using SAS & STATA Cancer Clinical Trials: Current and Iftekhar Khan Controversial Issues in Design and Analysis Design and Analysis of Clinical Trials Stephen L. George, Xiaofei Wang, for Predictive Medicine and Herbert Pang Shigeyuki Matsui, Marc Buyse, and Richard Simon Causal Analysis in Biomedicine and Epidemiology: Based on Minimal Design and Analysis of Clinical Trials with Sufficient Causation Time-to-Event Endpoints Mikel Aickin Karl E. Peace Clinical and Statistical Considerations in Design and Analysis of Non-Inferiority Trials Personalized Medicine Mark D. Rothmann, Brian L. Wiens, Claudio Carini, Sandeep Menon, and Mark Chang and Ivan S. F. Chan Clinical Trial Data Analysis using R Difference Equations with Public Health Ding-Geng (Din) Chen and Karl E. Peace Applications Lemuel A. Moyé and Asha Seth Kapadia Clinical Trial Methodology Karl E. Peace and Ding-Geng (Din) Chen DNA Methylation Microarrays: Experimental Design and Statistical Computational Methods in Biomedical Analysis Research Sun-Chong Wang and Arturas Petronis Ravindra Khattree and Dayanand N. Naik DNA Microarrays and Related Genomics Computational Pharmacokinetics Techniques: Design, Analysis, and Anders Källén Interpretation of Experiments Confidence Intervals for Proportions David B. Allison, Grier P. Page, and Related Measures of Effect Size T. Mark Beasley, and Jode W. Edwards Robert G. Newcombe Dose Finding by the Continual Controversial Statistical Issues in Reassessment Method Clinical Trials Ying Kuen Cheung Shein-Chung Chow Dynamical Biostatistical Models Data Analysis with Competing Risks Daniel Commenges and and Intermediate States Hélène Jacqmin-Gadda Ronald B. Geskus Elementary Bayesian Biostatistics Data and Safety Monitoring Committees Lemuel A. Moyé in Clinical Trials Empirical Likelihood Method in Jay Herson Survival Analysis Design and Analysis of Animal Studies Mai Zhou in Pharmaceutical Development Exposure–Response Modeling: Methods Shein-Chung Chow and Jen-pei Liu and Practical Implementation Design and Analysis of Bioavailability Jixian Wang and Bioequivalence Studies, Third Edition Frailty Models in Survival Analysis Shein-Chung Chow and Jen-pei Liu Andreas Wienke Published Titles Fundamental Concepts for New Clinical Multiregional Clinical Trials for Trialists Simultaneous Global New Drug Scott Evans and Naitee Ting Development Joshua Chen and Hui Quan Generalized Linear Models: A Bayesian Perspective Multiple Testing Problems in Dipak K. Dey, Sujit K. Ghosh, and Pharmaceutical Statistics Bani K. Mallick Alex Dmitrienko, Ajit C. Tamhane, and Frank Bretz Handbook of Regression and Modeling: Applications for the Clinical and Noninferiority Testing in Clinical Trials: Pharmaceutical Industries Issues and Challenges Daryl S. Paulson Tie-Hua Ng Inference Principles for Biostatisticians Optimal Design for Nonlinear Response Ian C. Marschner Models Valerii V. Fedorov and Sergei L. Leonov Interval-Censored Time-to-Event Data: Methods and Applications Patient-Reported Outcomes: Ding-Geng (Din) Chen, Jianguo Sun, Measurement, Implementation and and Karl E. Peace Interpretation Joseph C. Cappelleri, Kelly H. Zou, Introductory Adaptive Trial Designs: Andrew G. Bushmakin, Jose Ma. J. Alvir, A Practical Guide with R Demissie Alemayehu, and Tara Symonds Mark Chang Quantitative Evaluation of Safety in Drug Joint Models for Longitudinal and Time- Development: Design, Analysis and to-Event Data: With Applications in R Reporting Dimitris Rizopoulos Qi Jiang and H. Amy Xia Measures of Interobserver Agreement Quantitative Methods for Traditional and Reliability, Second Edition Chinese Medicine Development Mohamed M. Shoukri Shein-Chung Chow Medical Biostatistics, Third Edition Randomized Clinical Trials of A. Indrayan Nonpharmacological Treatments Meta-Analysis in Medicine and Isabelle Boutron, Philippe Ravaud, Health Policy and David Moher Dalene Stangl and Donald A. Berry Randomized Phase II Cancer Mixed Effects Models for the Population Clinical Trials Approach: Models, Tasks, Methods Sin-Ho Jung and Tools Sample Size Calculations for Clustered Marc Lavielle and Longitudinal Outcomes in Clinical Modeling to Inform Infectious Disease Research Control Chul Ahn, Moonseong Heo, Niels G. Becker and Song Zhang Modern Adaptive Randomized Clinical Sample Size Calculations in Clinical Trials: Statistical and Practical Aspects Research, Second Edition Oleksandr Sverdlov Shein-Chung Chow, Jun Shao, Monte Carlo Simulation for the and Hansheng Wang Pharmaceutical Industry: Concepts, Statistical Analysis of Human Growth Algorithms, and Case Studies and Development Mark Chang Yin Bun Cheung Published Titles Statistical Design and Analysis of Clinical Statistical Testing Strategies in the Trials: Principles and Methods Health Sciences Weichung Joe Shih and Joseph Aisner Albert Vexler, Alan D. Hutson, and Xiwei Chen Statistical Design and Analysis of Stability Studies Statistics in Drug Research: Shein-Chung Chow Methodologies and Recent Developments Statistical Evaluation of Diagnostic Shein-Chung Chow and Jun Shao Performance: Topics in ROC Analysis Kelly H. Zou, Aiyi Liu, Andriy Bandos, Statistics in the Pharmaceutical Industry, Lucila Ohno-Machado, and Howard Rockette Third Edition Ralph Buncher and Jia-Yeong Tsay Statistical Methods for Clinical Trials Mark X. Norleans Survival Analysis in Medicine and Genetics Statistical Methods for Drug Safety Jialiang Li and Shuangge Ma Robert D. Gibbons and Anup K. Amatya Theory of Drug Development Statistical Methods for Immunogenicity Eric B. Holmgren Assessment Harry Yang, Jianchun Zhang, Binbing Yu, Translational Medicine: Strategies and and Wei Zhao Statistical Methods Dennis Cosmatos and Shein-Chung Chow Statistical Methods in Drug Combination Studies Wei Zhao and Harry Yang CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2017 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed on acid-free paper Version Date: 20160406 International Standard Book Number-13: 978-1-4822-0823-8 (Hardback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit- ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright. com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging‑in‑Publication Data Names: Kasim, Adeyto, editor. Title: Applied biclustering methods for big and high dimensional data using R / editors, Adeyto Kasim, Ziv Shkedy, Sebastian Kaiser, Sepp Hochreiter and Willem Talloen. Description: Boca Raton : Taylor & Francis, CRC Press, 2016. | Includes bibliographical references and index. Identifiers: LCCN 2016003221 | ISBN 9781482208238 (alk. paper) Subjects: LCSH: Big data. | Cluster set theory. | R (Computer program language) Classification: LCC QA76.9.B45 A67 2016 | DDC 005.7--dc23 LC record available at https://lccn.loc.gov/2016003221 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Contents Preface xvii Contributors xix R Packages and Products xxiii 1 Introduction 1 Ziv Shkedy, Adetayo Kasim, Sepp Hochreiter, Sebastian Kaiser and Willem Talloen 1.1 From Clustering to Biclustering . . . . . . . . . . . . . . . . 1 1.2 We R a Community . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Biclustering for Cloud Computing . . . . . . . . . . . . . . . 3 1.4 Book Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5.1 Dutch Breast Cancer Data . . . . . . . . . . . . . . . 5 1.5.2 Diffuse Large B-Cell Lymphoma (DLBCL) . . . . . . 5 1.5.3 Multiple Tissue Types Data . . . . . . . . . . . . . . . 6 1.5.4 CMap Dataset . . . . . . . . . . . . . . . . . . . . . . 6 1.5.5 NCI60 Panel . . . . . . . . . . . . . . . . . . . . . . . 6 1.5.6 1000 Genomes Project . . . . . . . . . . . . . . . . . . 7 1.5.7 Tourism Survey Data . . . . . . . . . . . . . . . . . . 7 1.5.8 Toxicogenomics Project . . . . . . . . . . . . . . . . . 7 1.5.9 Yeast Data . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5.10 mglu2 Project. . . . . . . . . . . . . . . . . . . . . . . 8 1.5.11 TCGA Data . . . . . . . . . . . . . . . . . . . . . . . 9 1.5.12 NBA Data . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5.13 Colon Cancer Data . . . . . . . . . . . . . . . . . . . . 9 2 From Cluster Analysis to Biclustering 11 Dhammika Amaratunga, Javier Cabrera, Nolen Joy Perualila, Adetayo Kasim and Ziv Shkedy 2.1 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 An Introduction . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 Dissimilarity Measures and Similarity Measures . . . . 13 2.1.2.1 Example 1: Clustering Compounds in the CMAP Data Based on Chemical Similarity . 16 vii viii Contents 2.1.2.2 Example 2 . . . . . . . . . . . . . . . . . . . 16 2.1.3 Hierarchical Clustering. . . . . . . . . . . . . . . . . . 19 2.1.3.1 Example 1 . . . . . . . . . . . . . . . . . . . 21 2.1.3.2 Example 2 . . . . . . . . . . . . . . . . . . . 21 2.1.4 ABC Dissimilarity for High-Dimensional Data. . . . . 23 2.2 Biclustering: A Graphical Tour . . . . . . . . . . . . . . . . . 27 2.2.1 Global versus Local Patterns . . . . . . . . . . . . . . 27 2.2.2 Bicluster’s Type . . . . . . . . . . . . . . . . . . . . . 27 2.2.3 Bicluster’s Configuration . . . . . . . . . . . . . . . . 32 I Biclustering Methods 35 3 δ-Biclustering and FLOC Algorithm 37 Adetayo Kasim, Sepp Hochreiter and Ziv Shkedy 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 δ-Biclustering . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.1 Single-Node Deletion Algorithm . . . . . . . . . . . . 39 3.2.2 Multiple-Node Deletion Algorithm . . . . . . . . . . . 39 3.2.3 Node Addition Algorithm . . . . . . . . . . . . . . . . 40 3.2.4 Application to Yeast Data . . . . . . . . . . . . . . . . 41 3.3 FLOC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.1 FLOC Phase I . . . . . . . . . . . . . . . . . . . . . . 45 3.3.2 FLOC Phase II . . . . . . . . . . . . . . . . . . . . . . 45 3.3.3 FLOC Application to Yeast Data . . . . . . . . . . . . 45 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4 The xMotif algorithm 49 Ewoud De Troyer, Dan Lin, Ziv Shkedy and Sebastian Kaiser 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 xMotif Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.2 Search Algorithm . . . . . . . . . . . . . . . . . . . . . 50 4.3 Biclustering with xMotif . . . . . . . . . . . . . . . . . . . . 53 4.3.1 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3.2 Discretisation and Parameter Settings . . . . . . . . . 55 4.3.2.1 Discretisation. . . . . . . . . . . . . . . . . . 55 4.3.2.2 ParametersSetting . . . . . . . . . . . . . . 55 4.3.3 Using the biclust Package . . . . . . . . . . . . . . . 57 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5 Bimax Algorithm 61 Ewoud De Troyer, Suzy Van Sanden, Ziv Shkedy and Sebastian Kaiser 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Bimax Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Contents ix 5.2.2 Search Algorithm . . . . . . . . . . . . . . . . . . . . . 62 5.3 Biclustering with Bimax . . . . . . . . . . . . . . . . . . . . 65 5.3.1 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3.2 Biclustering Using the Bimax Method . . . . . . . . . 66 5.3.3 Influence of the ParametersSetting . . . . . . . . . . . 67 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6 The Plaid Model 73 Ziv Shkedy, Ewoud De Troyer, Adetayo Kasim, Sepp Hochreiter and Heather Turner 6.1 Plaid Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.1.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.1.2 Overlapping Biclusters . . . . . . . . . . . . . . . . . . 74 6.1.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . 75 6.1.4 Search Algorithm . . . . . . . . . . . . . . . . . . . . . 78 6.2 Implementation in R . . . . . . . . . . . . . . . . . . . . . . 79 6.2.1 Constant Biclusters . . . . . . . . . . . . . . . . . . . 79 6.2.2 Misclassification of the Mean Structure . . . . . . . . 81 6.3 Plaid Model in BiclustGUI . . . . . . . . . . . . . . . . . . 82 6.4 Mean Structure of a Bicluster . . . . . . . . . . . . . . . . . 83 6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7 Spectral Biclustering 89 Adetayo Kasim, Setia Pramana and Ziv Shkedy 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.2 Normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . 90 7.2.1 Independent Rescaling of Rows and Columns (IRRC) 91 7.2.2 Bistochastisation . . . . . . . . . . . . . . . . . . . . . 91 7.2.3 Log Interactions . . . . . . . . . . . . . . . . . . . . . 91 7.3 Spectral Biclustering . . . . . . . . . . . . . . . . . . . . . . 92 7.4 Spectral Biclustering Using the biclust Package . . . . . . . 93 7.4.1 Application to DLBCL Dataset . . . . . . . . . . . . . 94 7.4.2 Analysis of a Test Data . . . . . . . . . . . . . . . . . 95 7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 8 FABIA 99 Sepp Hochreiter 8.1 FABIA Model . . . . . . . . . . . . . . . . . . . . . . . . . . 100 8.1.1 The Idea . . . . . . . . . . . . . . . . . . . . . . . . . 100 8.1.2 Model Formulation . . . . . . . . . . . . . . . . . . . . 101 8.1.3 Parameter Estimation . . . . . . . . . . . . . . . . . . 103 8.1.4 Bicluster Extraction . . . . . . . . . . . . . . . . . . . 106 8.2 Implementation in R . . . . . . . . . . . . . . . . . . . . . . 107 8.3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 8.3.1 Breast Cancer Data . . . . . . . . . . . . . . . . . . . 109 x Contents 8.3.2 Multiple Tissues Data . . . . . . . . . . . . . . . . . . 112 8.3.3 Diffuse Large B-Cell Lymphoma (DLBCL) Data . . . 113 8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 9 Iterative Signature Algorithm 119 Adetayo Kasim and Ziv Shkedy 9.1 Introduction: Bicluster Definition . . . . . . . . . . . . . . . 119 9.2 Iterative Signature Algorithm . . . . . . . . . . . . . . . . . 122 9.3 Biclustering Using ISA . . . . . . . . . . . . . . . . . . . . . 123 9.3.1 isa2 Package . . . . . . . . . . . . . . . . . . . . . . . 123 9.3.2 Application to Breast Data . . . . . . . . . . . . . . . 124 9.3.3 Application to the DLBCL Data . . . . . . . . . . . . 126 9.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 10 Ensemble Methods and Robust Solutions 131 Tatsiana Khamiakova, Sebastian Kasier and Ziv Shkedy 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 10.2 Motivating Example (I) . . . . . . . . . . . . . . . . . . . . 132 10.3 Ensemble Method . . . . . . . . . . . . . . . . . . . . . . . . 134 10.3.1 Initialization Step . . . . . . . . . . . . . . . . . . . . 134 10.3.2 Combination Step . . . . . . . . . . . . . . . . . . . . 134 10.3.2.1 Similarity Indices . . . . . . . . . . . . . . . 135 10.3.2.2 CorrelationApproach . . . . . . . . . . . . . 135 10.3.2.3 Hierarchical Clustering . . . . . . . . . . . . 136 10.3.2.4 Quality Clustering . . . . . . . . . . . . . . . 136 10.3.3 Merging Step . . . . . . . . . . . . . . . . . . . . . . . 137 10.4 Application of Ensemble Biclustering for the Breast Cancer Data Using superbiclustPackage . . . . . . . . . . . . . . 138 10.4.1 Robust Analysis for the Plaid Model . . . . . . . . . . 138 10.4.2 Robust Analysis of ISA . . . . . . . . . . . . . . . . . 141 10.4.3 FABIA: Overlap between Biclusters . . . . . . . . . . 142 10.4.4 Biclustering Analysis Combining Several Methods . . 143 10.5 ApplicationofEnsembleBiclusteringtotheTCGADataUsing biclust Implementation . . . . . . . . . . . . . . . . . . . . 146 10.5.1 Motivating Example (II) . . . . . . . . . . . . . . . . 146 10.5.2 Correlation Approach . . . . . . . . . . . . . . . . . . 146 10.5.3 Jaccard Index Approach . . . . . . . . . . . . . . . . . 149 10.5.4 ComparisonbetweenJaccardIndexandtheCorrelation Approach . . . . . . . . . . . . . . . . . . . . . . . . . 149 10.5.5 Implementation in R . . . . . . . . . . . . . . . . . . . 151 10.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 II Case Studies and Applications 157