Big Data Analytics in Oncology with R Big Data Analytics in Oncology with R serves the analytical approaches for big data analysis. There is huge progressed in advanced computation with R. But there are several technical challenges faced to work with big data. These challeng- es are with computational aspect and work with fastest way to get computational results. Clinical decision through genomic information and survival outcomes are now unavoidable in cutting-edge oncology research. This book is intended to provide a comprehensive text to work with some recent development in the area. Features • Covers gene expression data analysis using R and survival analysis using R • Includes bayesian in survival-gene expression analysis • Discusses competing-gene expression analysis using R • Covers Bayesian on survival with omics data This book is aimed primarily at graduates and researchers studying survival anal- ysis or statistical methods in genetics. Big Data Analytics in Oncology with R Atanu Bhattacharjee First edition published 2023 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN CRC Press is an imprint of Taylor & Francis Group, LLC © 2023 Atanu Bhattacharjee Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publica- tion and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, trans- mitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750- 8400. For works that are not available on CCC please contact [email protected] Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging‑in‑Publication Data Names: Bhattacharjee, Atanu, author. Title: Big data analytics in oncology with R / Atanu Bhattacharjee. Description: First edition. | Boca Raton : Chapman & Hall/CRC Press, 2023. | Includes bibliographical references and index. | Summary: “Big Data Analytics in Oncology with R serves the analytical approaches for big data analysis. There is huge progressed in advanced computation with R. But there are several technical challenges faced to work with big data. These challenges are with computational aspect and work with fastest way to get computational results. Clinical decision through genomic information and survival outcomes are now unavoidable in cutting-edge oncology research. This book is intended to provide a comprehensive text to work with some recent development in the area”-- Provided by publisher. Identifiers: LCCN 2022034791 (print) | LCCN 2022034792 (ebook) | ISBN 9781032028767 (hardback) | ISBN 9781032028774 (paperback) | ISBN 9781003185598 (ebook) Subjects: MESH: Medical Oncology--statistics & numerical data | Algorithms | Big Data | Data Interpretation, Statistical | Programming Languages Classification: LCC RC263 (print) | LCC RC263 (ebook) | NLM QZ 26.5 | DDC 616.99/4--dc23/eng/20221115 LC record available at https://lccn.loc.gov/2022034791 LC ebook record available at https://lccn.loc.gov/2022034792 ISBN: 978-1-032-02876-7 (hbk) ISBN: 978-1-032-02877-4 (pbk) ISBN: 978-1-003-18559-8 (ebk) DOI: 10.1201/9781003185598 Typeset in CMR10 by KnowledgeWorks Global Ltd. Publisher’s note: This book has been prepared from camera-ready copy provided by the authors. I dedicated this book to my family Chaitry, Atrideep, Ma and Baba. Contents Preface xiii Author xv 1 Survival Analysis 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Hazard Function . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Study Design and Survival Analysis . . . . . . . . . . . . . . 6 1.5 Survival Analysis Objective . . . . . . . . . . . . . . . . . . . 8 1.6 Non-Parametric Approach for Survival Analysis . . . . . . . 9 1.7 Log-Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.8 Median Follow-Up Time Calculation . . . . . . . . . . . . . 10 1.9 Survival Data . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.9.1 Multiple event-time data . . . . . . . . . . . . . . . . 11 1.9.2 Multivariate survival data . . . . . . . . . . . . . . . . 11 1.9.3 Univariate survival models. . . . . . . . . . . . . . . . 12 1.9.4 Multivariate survival models . . . . . . . . . . . . . . 12 1.9.5 Doubly interval-censored survival data . . . . . . . . . 13 1.9.6 Frequentist approach . . . . . . . . . . . . . . . . . . . 13 1.10 Bayesian Prior Assumptions for Survival Analysis . . . . . . 14 1.10.1 Prior in survival analysis . . . . . . . . . . . . . . . . 15 1.10.2 Dirichlet process prior . . . . . . . . . . . . . . . . . . 15 1.11 Illustration Using R . . . . . . . . . . . . . . . . . . . . . . . 15 2 Cox Proportional Survival Analysis 25 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 vii viii Contents 2.2 Cox Proportional Hazard . . . . . . . . . . . . . . . . . . . . 25 2.2.1 Hazard ratio . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.2 Partial likelihood function . . . . . . . . . . . . . . . . 26 2.2.3 Wald score and Likelihood ratio tests . . . . . . . . . 27 2.3 Cox Proportional Diagnostics Test . . . . . . . . . . . . . . . 27 2.3.1 Cox-snell residual. . . . . . . . . . . . . . . . . . . . . 28 2.3.2 Martingale residual . . . . . . . . . . . . . . . . . . . 29 2.4 Mean and Median Survival Time . . . . . . . . . . . . . . . . 29 2.5 Stratified Cox Proportional Hazard Test . . . . . . . . . . . 30 2.6 Schoenfeld Residuals . . . . . . . . . . . . . . . . . . . . . . 30 2.7 Extended Cox Regression Model . . . . . . . . . . . . . . . . 31 2.8 Illustration Using R . . . . . . . . . . . . . . . . . . . . . . . 32 2.8.1 UnivariateCoxproportionalhazardinhighdimensional data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.8.2 Expectation-maximization algorithm . . . . . . . . . . 37 3 Parametric Survival Analysis 39 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Regularized Survival Analysis . . . . . . . . . . . . . . . . . 40 3.3 Gaussian Prior and Ridge Regression . . . . . . . . . . . . . 41 3.4 Laplacian Prior and Lasso Regression . . . . . . . . . . . . . 42 3.5 Parameteric Survival Analysis . . . . . . . . . . . . . . . . . 42 3.6 Different Distribution . . . . . . . . . . . . . . . . . . . . . . 43 3.6.1 Exponential distribution . . . . . . . . . . . . . . . . . 43 3.6.2 Weibull model . . . . . . . . . . . . . . . . . . . . . . 43 3.6.3 Gamma distribution . . . . . . . . . . . . . . . . . . . 44 3.7 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . 45 3.8 Illustration Using R . . . . . . . . . . . . . . . . . . . . . . . 45 4 Competing Risk Modeling in High Dimensional Data 49 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Survival and Competing Risk Model . . . . . . . . . . . . . . 52 4.3 The Competing Risk Models . . . . . . . . . . . . . . . . . . 54 4.4 Aalen’s Additive Hazards Model . . . . . . . . . . . . . . . . 58 4.5 Bayesian Formulation . . . . . . . . . . . . . . . . . . . . . . 59 Contents ix 4.6 The Lasso Method . . . . . . . . . . . . . . . . . . . . . . . . 61 4.7 Metropolis Algorithm . . . . . . . . . . . . . . . . . . . . . . 62 4.8 Deviance Information Criterion and Akaike Information Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.9 Illustration with Example Data . . . . . . . . . . . . . . . . 64 4.10 Bayesian for Competing Risk Analysis Illustration Using R . 68 5 Biomarker Thresholding in High Dimensional Data 73 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 Statistical Methodology for Biomarker Thresholding . . . . 74 5.3 Thresholding for Repeatedly Measured Biomarker . . . . . . 75 5.4 Statistical Model . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.5 Repeteadly Measured Biomarker Thresholding . . . . . . . . 80 5.6 Biomarkar Thresholding Determination . . . . . . . . . . . . 82 5.7 Illustration Using R . . . . . . . . . . . . . . . . . . . . . . . 87 5.8 Data Illustration . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.9 Classification and Regression Tree Analysis in Biomarker Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6 High Dimensional Survival Data Analysis 101 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.2 Challenges in High Dimensional Data . . . . . . . . . . . . . 102 6.3 Variable Selection in High Dimensional Data . . . . . . . . . 103 6.3.1 Lasso selection . . . . . . . . . . . . . . . . . . . . . . 103 6.3.2 Elastic net . . . . . . . . . . . . . . . . . . . . . . . . 104 6.3.3 Cox regression . . . . . . . . . . . . . . . . . . . . . . 105 6.4 Survival and High Dimensional Data . . . . . . . . . . . . . 106 6.5 Covariance Structure in High Dimensional Data . . . . . . . 107 6.6 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . 108 6.6.1 Bayesian information criterion . . . . . . . . . . . . . 108 6.6.2 Deviance information criterion . . . . . . . . . . . . . 109 6.6.3 Predictive criteria . . . . . . . . . . . . . . . . . . . . 109 6.7 Illustration Using R . . . . . . . . . . . . . . . . . . . . . . . 110 6.7.1 Data flietration with batches . . . . . . . . . . . . . . 113