Table Of ContentComparative Approaches
to Using R and Python for
Statistical Data Analysis
Rui Sarmento
University of Porto, Portugal
Vera Costa
University of Porto, Portugal
A volume in the Advances in
Systems Analysis, Software
Engineering, and High
Performance Computing
(ASASEHPC) Book Series
Published in the United States of America by
IGI Global
Information Science Reference (an imprint of IGI Global)
701 E. Chocolate Avenue
Hershey PA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail: cust@igi-global.com
Web site: http://www.igi-global.com
Copyright © 2017 by IGI Global. All rights reserved. No part of this publication may be
reproduced, stored or distributed in any form or by any means, electronic or mechanical, including
photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the
names of the products or companies does not indicate a claim of ownership by IGI Global of the
trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data
Names: Sarmento, Rui, 1979- | Costa, Vera, 1983-
Title: Comparative approaches to using R and Python for statistical data
analysis / by Rui Sarmento and Vera Costa.
Description: Hershey PA : Information Science Reference, [2017] | Includes
bibliographical references and index.
Identifiers: LCCN 2016050989| ISBN 9781683180166 (hardcover) | ISBN
9781522519898 (ebook)
Subjects: LCSH: Mathematical statistics--Data processing. | R (Computer
program language) | Python (Computer program language)
Classification: LCC QA276.45.R3 S27 2017 | DDC 519.50285/5133--dc23 LC record available at
https://lccn.loc.gov/2016050989
This book is published in the IGI Global book series Advances in Systems Analysis, Software
Engineering, and High Performance Computing (ASASEHPC) (ISSN: 2327-3453; eISSN: 2327-
3461)
British Cataloguing in Publication Data
A Cataloguing in Publication record for this book is available from the British Library.
All work contributed to this book is new, previously-unpublished material. The views expressed in
this book are those of the authors, but not necessarily of the publisher.
Advances in Systems
Analysis, Software
Engineering, and High
Performance Computing
(ASASEHPC) Book Series
ISSN:2327-3453
EISSN:2327-3461
Editor-in-Chief: Vijayan Sugumaran, Oakland University, USA
Mission
The theory and practice of computing applications and distributed systems has emerged
as one of the key areas of research driving innovations in business, engineering, and
science. The fields of software engineering, systems analysis, and high performance
computing offer a wide range of applications and solutions in solving computational
problems for any modern organization.
The Advances in Systems Analysis, Software Engineering, and High
Performance Computing (ASASEHPC) Book Series brings together research
in the areas of distributed computing, systems and software engineering, high
performance computing, and service science. This collection of publications is
useful for academics, researchers, and practitioners seeking the latest practices and
knowledge in this field.
Coverage
• Performance Modelling
IGI Global is currently accepting
• Computer System Analysis
manuscripts for publication within this
• Computer Networking
series. To submit a proposal for a volume in
• Engineering Environments
this series, please contact our Acquisition
• Human-Computer Interaction
Editors at Acquisitions@igi-global.com or
• Metadata and Semantic Web
visit: http://www.igi-global.com/publish/.
• Software Engineering
• Distributed Cloud Computing
• Enterprise Information Systems
• Virtual Data Systems
The Advances in Systems Analysis, Software Engineering, and High Performance Computing (ASASEHPC) Book
Series (ISSN 2327-3453) is published by IGI Global, 701 E. Chocolate Avenue, Hershey, PA 17033-1240, USA, www.
igi-global.com. This series is composed of titles available for purchase individually; each title is edited to be contextually
exclusive from any other title within the series. For pricing and ordering information please visit http://www.igi-global.
com/book-series/advances-systems-analysis-software-engineering/73689. Postmaster: Send all address changes to above
address. Copyright © 2017 IGI Global. All rights, including translation in other languages reserved by the publisher. No
part of this series may be reproduced or used in any form or by any means – graphics, electronic, or mechanical, including
photocopying, recording, taping, or information and retrieval systems – without written permission from the publisher,
except for non commercial, educational use, including classroom teaching purposes. The views expressed in this series
are those of the authors, but not necessarily of IGI Global.
Titles in this Series
For a list of additional titles in this series, please visit:
http://www.igi-global.com/book-series/advances-systems-analysis-software-engineering/73689
Resource Management and Efficiency in Cloud Computing Environments
Ashok Kumar Turuk (National Institute of Technology Rourkela, India) Bibhudatta Sahoo (Na-
tional Institute of Technology Rourkela, India) and Sourav Kanti Addya (National Institute of
Technology Rourkela, India)
Information Science Reference • ©2017 • 352pp • H/C (ISBN: 9781522517214) • US $205.00
Handbook of Research on End-to-End Cloud Computing Architecture Design
Jianwen “Wendy” Chen (IBM, Australia) Yan Zhang (Western Sydney University, Australia)
and Ron Gottschalk (IBM, Australia)
Information Science Reference • ©2017 • 507pp • H/C (ISBN: 9781522507598) • US $325.00
Innovative Research and Applications in Next-Generation High Performance Computing
Qusay F. Hassan (Mansoura University, Egypt)
Information Science Reference • ©2016 • 488pp • H/C (ISBN: 9781522502876) • US $205.00
Developing Interoperable and Federated Cloud Architecture
Gabor Kecskemeti (University of Miskolc, Hungary) Attila Kertesz (University of Szeged,
Hungary) and Zsolt Nemeth (MTA SZTAKI, Hungary)
Information Science Reference • ©2016 • 398pp • H/C (ISBN: 9781522501534) • US $210.00
Managing Big Data in Cloud Computing Environments
Zongmin Ma (Nanjing University of Aeronautics and Astronautics, China)
Information Science Reference • ©2016 • 314pp • H/C (ISBN: 9781466698345) • US $195.00
Emerging Innovations in Agile Software Development
Imran Ghani (Universiti Teknologi Malaysia, Malaysia) Dayang Norhayati Abang Jawawi (Uni-
versiti Teknologi Malaysia, Malaysia) Siva Dorairaj (Software Education, New Zealand) and
Ahmed Sidky (ICAgile, USA)
Information Science Reference • ©2016 • 323pp • H/C (ISBN: 9781466698581) • US $205.00
For an enitre list of titles in this series, please visit:
http://www.igi-global.com/book-series/advances-systems-analysis-software-engineering/73689
701 East Chocolate Avenue, Hershey, PA 17033, USA
Tel: 717-533-8845 x100 • Fax: 717-533-8661
E-Mail: cust@igi-global.com • www.igi-global.com
To our parents and family…
Table of Contents
Preface.................................................................................................................viii
; ;
Introduction...........................................................................................................x
; ;
Chapter 1
;
Statistics..................................................................................................................1
; ;
Chapter 2
;
Introduction.to.Programming.R.and.Python.Languages.......................................32
; ;
Chapter 3
;
Dataset..................................................................................................................78
; ;
Chapter 4
;
Descriptive.Analysis.............................................................................................83
; ;
Chapter 5
;
Statistical.Inference.............................................................................................114
; ;
Chapter 6
;
Introduction.to.Linear.Regression......................................................................140
; ;
Chapter 7
;
Factor.Analysis...................................................................................................148
; ;
Chapter 8
;
Clusters...............................................................................................................179
; ;
Chapter 9
;
Discussion.and.Conclusion.................................................................................191
; ;
About the Authors.............................................................................................195
; ;
Index...................................................................................................................196
; ;
viii
Preface
We may at once admit that any inference from the particular to the general
must be attended with some degree of uncertainty, but this is not the same
as to admit that such inference cannot be absolutely rigorous, for the nature
and degree of the uncertainty may itself be capable of rigorous expression.
– Sir Ronald Fisher
The importance of Statistics in our world is increasing greatly in recent decades.
Due do the need to provide inference from data samples; statistics is one of
the greatest achievements of humanity. Its use has spread to a large range of
research areas, not only limited to research done by mathematicians or pure
statistics professionals. Nowadays, it is standard procedure to include some
statistical analysis when the scientific study involves data. There is a high
influence and demand for statistical analysis in today’s Medicine, Biology,
Psychology, Physics and many other areas.
The demand for statistical analysis of data has proliferated so much; it has
survived inclusively to attacks from the mathematical challenged.
If the statistics are boring, then you’ve got the wrong numbers. – Edward
R. Tufte
Thus, with the advent of computers and advanced computer software, the
intuitiveness of analysis software has evolved greatly in recent years and they
have opened to a wider audience of users. It is common to see another kind
of statistical researchers in modern academies. Those with no advanced stud-
ies in the mathematical areas are the new statisticians and use and produce
statistical studies with scarce or no help from others.
Above all else show the data. – Edward R. Tufte
ix
The need to expose the studies in a clear fashion for a non-specialized
audience has brought the development of, not only intuitive software but
software directed to the visualization of data and data analysis. For example,
the psychologist with no mathematical foundations can now choose from
several languages and software to add value to their studies by performing
throughout analysis of their data and present it in an understandable fashion.
This book presents a comparison of two of the available languages to
execute data analysis and statistical analysis, R language and also the Python
language. It is directed to anyone, experienced or not, that might need to
analyze his/her data in an understandable way. For those more experienced,
the authors of this book approach the theoretical fundamentals of statistics,
and for a larger range of audience, explain the programming fundamentals,
both with R and Python languages.
The statistical tasks range from Descriptive Analytics. The authors describe
the need for basic statistical metrics and present the main procedures with
both languages. Then, Inferential Statistics are presented in this book. High
importance is given to the most needed statistical tests to perform a coherent
data analysis. Following Inferential Statistics, the authors also provide ex-
amples, with both languages, in a throughout explanation of Factor Analysis.
The authors emphasize the importance of variable study and not only the
objects study. Nonetheless, the authors present a chapter also dedicated to
the clustering analysis of studied objects. Finally, an introductory study of
regression models and linear regression is also tabled in this book.
The authors do not deny that the structure of the book might pose some
comparison questions since the book deals with two different programming
languages. The authors end the book with a discussion that provides some
clarification on this subject but, above all, also provides some insights for
further consideration.
Finally, the authors would like to thank all the colleagues that provided
suggestions and reviewed the manuscript in all its development phases, and
all the friends and family members for their support.