Comparative Approaches to Using R and Python for Statistical Data Analysis Rui Sarmento University of Porto, Portugal Vera Costa University of Porto, Portugal A volume in the Advances in Systems Analysis, Software Engineering, and High Performance Computing (ASASEHPC) Book Series Published in the United States of America by IGI Global Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue Hershey PA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: [email protected] Web site: http://www.igi-global.com Copyright © 2017 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. Library of Congress Cataloging-in-Publication Data Names: Sarmento, Rui, 1979- | Costa, Vera, 1983- Title: Comparative approaches to using R and Python for statistical data analysis / by Rui Sarmento and Vera Costa. Description: Hershey PA : Information Science Reference, [2017] | Includes bibliographical references and index. Identifiers: LCCN 2016050989| ISBN 9781683180166 (hardcover) | ISBN 9781522519898 (ebook) Subjects: LCSH: Mathematical statistics--Data processing. | R (Computer program language) | Python (Computer program language) Classification: LCC QA276.45.R3 S27 2017 | DDC 519.50285/5133--dc23 LC record available at https://lccn.loc.gov/2016050989 This book is published in the IGI Global book series Advances in Systems Analysis, Software Engineering, and High Performance Computing (ASASEHPC) (ISSN: 2327-3453; eISSN: 2327- 3461) British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the authors, but not necessarily of the publisher. Advances in Systems Analysis, Software Engineering, and High Performance Computing (ASASEHPC) Book Series ISSN:2327-3453 EISSN:2327-3461 Editor-in-Chief: Vijayan Sugumaran, Oakland University, USA Mission The theory and practice of computing applications and distributed systems has emerged as one of the key areas of research driving innovations in business, engineering, and science. The fields of software engineering, systems analysis, and high performance computing offer a wide range of applications and solutions in solving computational problems for any modern organization. The Advances in Systems Analysis, Software Engineering, and High Performance Computing (ASASEHPC) Book Series brings together research in the areas of distributed computing, systems and software engineering, high performance computing, and service science. This collection of publications is useful for academics, researchers, and practitioners seeking the latest practices and knowledge in this field. Coverage • Performance Modelling IGI Global is currently accepting • Computer System Analysis manuscripts for publication within this • Computer Networking series. To submit a proposal for a volume in • Engineering Environments this series, please contact our Acquisition • Human-Computer Interaction Editors at [email protected] or • Metadata and Semantic Web visit: http://www.igi-global.com/publish/. • Software Engineering • Distributed Cloud Computing • Enterprise Information Systems • Virtual Data Systems The Advances in Systems Analysis, Software Engineering, and High Performance Computing (ASASEHPC) Book Series (ISSN 2327-3453) is published by IGI Global, 701 E. Chocolate Avenue, Hershey, PA 17033-1240, USA, www. igi-global.com. This series is composed of titles available for purchase individually; each title is edited to be contextually exclusive from any other title within the series. For pricing and ordering information please visit http://www.igi-global. com/book-series/advances-systems-analysis-software-engineering/73689. Postmaster: Send all address changes to above address. Copyright © 2017 IGI Global. All rights, including translation in other languages reserved by the publisher. No part of this series may be reproduced or used in any form or by any means – graphics, electronic, or mechanical, including photocopying, recording, taping, or information and retrieval systems – without written permission from the publisher, except for non commercial, educational use, including classroom teaching purposes. The views expressed in this series are those of the authors, but not necessarily of IGI Global. Titles in this Series For a list of additional titles in this series, please visit: http://www.igi-global.com/book-series/advances-systems-analysis-software-engineering/73689 Resource Management and Efficiency in Cloud Computing Environments Ashok Kumar Turuk (National Institute of Technology Rourkela, India) Bibhudatta Sahoo (Na- tional Institute of Technology Rourkela, India) and Sourav Kanti Addya (National Institute of Technology Rourkela, India) Information Science Reference • ©2017 • 352pp • H/C (ISBN: 9781522517214) • US $205.00 Handbook of Research on End-to-End Cloud Computing Architecture Design Jianwen “Wendy” Chen (IBM, Australia) Yan Zhang (Western Sydney University, Australia) and Ron Gottschalk (IBM, Australia) Information Science Reference • ©2017 • 507pp • H/C (ISBN: 9781522507598) • US $325.00 Innovative Research and Applications in Next-Generation High Performance Computing Qusay F. Hassan (Mansoura University, Egypt) Information Science Reference • ©2016 • 488pp • H/C (ISBN: 9781522502876) • US $205.00 Developing Interoperable and Federated Cloud Architecture Gabor Kecskemeti (University of Miskolc, Hungary) Attila Kertesz (University of Szeged, Hungary) and Zsolt Nemeth (MTA SZTAKI, Hungary) Information Science Reference • ©2016 • 398pp • H/C (ISBN: 9781522501534) • US $210.00 Managing Big Data in Cloud Computing Environments Zongmin Ma (Nanjing University of Aeronautics and Astronautics, China) Information Science Reference • ©2016 • 314pp • H/C (ISBN: 9781466698345) • US $195.00 Emerging Innovations in Agile Software Development Imran Ghani (Universiti Teknologi Malaysia, Malaysia) Dayang Norhayati Abang Jawawi (Uni- versiti Teknologi Malaysia, Malaysia) Siva Dorairaj (Software Education, New Zealand) and Ahmed Sidky (ICAgile, USA) Information Science Reference • ©2016 • 323pp • H/C (ISBN: 9781466698581) • US $205.00 For an enitre list of titles in this series, please visit: http://www.igi-global.com/book-series/advances-systems-analysis-software-engineering/73689 701 East Chocolate Avenue, Hershey, PA 17033, USA Tel: 717-533-8845 x100 • Fax: 717-533-8661 E-Mail: [email protected] • www.igi-global.com To our parents and family… Table of Contents Preface.................................................................................................................viii ; ; Introduction...........................................................................................................x ; ; Chapter 1 ; Statistics..................................................................................................................1 ; ; Chapter 2 ; Introduction.to.Programming.R.and.Python.Languages.......................................32 ; ; Chapter 3 ; Dataset..................................................................................................................78 ; ; Chapter 4 ; Descriptive.Analysis.............................................................................................83 ; ; Chapter 5 ; Statistical.Inference.............................................................................................114 ; ; Chapter 6 ; Introduction.to.Linear.Regression......................................................................140 ; ; Chapter 7 ; Factor.Analysis...................................................................................................148 ; ; Chapter 8 ; Clusters...............................................................................................................179 ; ; Chapter 9 ; Discussion.and.Conclusion.................................................................................191 ; ; About the Authors.............................................................................................195 ; ; Index...................................................................................................................196 ; ; viii Preface We may at once admit that any inference from the particular to the general must be attended with some degree of uncertainty, but this is not the same as to admit that such inference cannot be absolutely rigorous, for the nature and degree of the uncertainty may itself be capable of rigorous expression. – Sir Ronald Fisher The importance of Statistics in our world is increasing greatly in recent decades. Due do the need to provide inference from data samples; statistics is one of the greatest achievements of humanity. Its use has spread to a large range of research areas, not only limited to research done by mathematicians or pure statistics professionals. Nowadays, it is standard procedure to include some statistical analysis when the scientific study involves data. There is a high influence and demand for statistical analysis in today’s Medicine, Biology, Psychology, Physics and many other areas. The demand for statistical analysis of data has proliferated so much; it has survived inclusively to attacks from the mathematical challenged. If the statistics are boring, then you’ve got the wrong numbers. – Edward R. Tufte Thus, with the advent of computers and advanced computer software, the intuitiveness of analysis software has evolved greatly in recent years and they have opened to a wider audience of users. It is common to see another kind of statistical researchers in modern academies. Those with no advanced stud- ies in the mathematical areas are the new statisticians and use and produce statistical studies with scarce or no help from others. Above all else show the data. – Edward R. Tufte ix The need to expose the studies in a clear fashion for a non-specialized audience has brought the development of, not only intuitive software but software directed to the visualization of data and data analysis. For example, the psychologist with no mathematical foundations can now choose from several languages and software to add value to their studies by performing throughout analysis of their data and present it in an understandable fashion. This book presents a comparison of two of the available languages to execute data analysis and statistical analysis, R language and also the Python language. It is directed to anyone, experienced or not, that might need to analyze his/her data in an understandable way. For those more experienced, the authors of this book approach the theoretical fundamentals of statistics, and for a larger range of audience, explain the programming fundamentals, both with R and Python languages. The statistical tasks range from Descriptive Analytics. The authors describe the need for basic statistical metrics and present the main procedures with both languages. Then, Inferential Statistics are presented in this book. High importance is given to the most needed statistical tests to perform a coherent data analysis. Following Inferential Statistics, the authors also provide ex- amples, with both languages, in a throughout explanation of Factor Analysis. The authors emphasize the importance of variable study and not only the objects study. Nonetheless, the authors present a chapter also dedicated to the clustering analysis of studied objects. Finally, an introductory study of regression models and linear regression is also tabled in this book. The authors do not deny that the structure of the book might pose some comparison questions since the book deals with two different programming languages. The authors end the book with a discussion that provides some clarification on this subject but, above all, also provides some insights for further consideration. Finally, the authors would like to thank all the colleagues that provided suggestions and reviewed the manuscript in all its development phases, and all the friends and family members for their support.