Data Science Desktop Survival Guide Graham Williams Togaware [email protected] 2021-06-24 Preface 20210308 “The enjoyment of one’s tools is an essential ingredient of successful work.” Donald E. Knuth This book is undergoing conversion from LaTeX to Bookdown. It is a work in progress and there remains numerous glitches. Please bare with us. Welcome to the world of Artificial Intelligence (AI), Machine Learning, and Data Science. Data Science has come to describe many different activities that are driven by data utilising technological advances in AI and Machine Learning, amongst others. Indeed, we can think of Data Science as an ecosystem of both technology and experiences, shared through the freedom of open source software that implements tools for AI and Machine Learning. Open source software gives us the freedom to do as we want with the software, with very few, and ideally no, restrictions, except the requirement to maintain our freedoms. The aim of this book is to gently guide the novice along the pathway to Data Science, from data processing through Machine Learning and to AI. I hope I can share the excitement of a fun and productive environment for exploring data with a focus on the R language as our platform of choice. R remains the most flexible and powerful language developed specifically for the analysis of data. This book provides a guide to the many different regions of the R platform, with a focus on doing what is required of the Data Scientist. It is comprehensive, beginning with basic support for the novice Data Scientist, moving into recipes for the variety of analyses we may find ourselves needing. On completing an install of R (which may take only a few minutes) you are ready to explore your data and beyond. All of the different types of data analyses are covered in this book, including basic data ingestion, data cleaning and wrangling, data visualisation, modelling and evaluation in order to discover new knowledge from our data. Tools for developers of systems to be deployed are well represented. Your donation will support ongoing development and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0. Data Science Desktop Survival Guide Graham Williams Togaware [email protected] 2021-06-24 Preface 20210308 “The enjoyment of one’s tools is an essential ingredient of successful work.” Donald E. Knuth This book is undergoing conversion from LaTeX to Bookdown. It is a work in progress and there remains numerous glitches. Please bare with us. Welcome to the world of Artificial Intelligence (AI), Machine Learning, and Data Science. Data Science has come to describe many different activities that are driven by data utilising technological advances in AI and Machine Learning, amongst others. Indeed, we can think of Data Science as an ecosystem of both technology and experiences, shared through the freedom of open source software that implements tools for AI and Machine Learning. Open source software gives us the freedom to do as we want with the software, with very few, and ideally no, restrictions, except the requirement to maintain our freedoms. The aim of this book is to gently guide the novice along the pathway to Data Science, from data processing through Machine Learning and to AI. I hope I can share the excitement of a fun and productive environment for exploring data with a focus on the R language as our platform of choice. R remains the most flexible and powerful language developed specifically for the analysis of data. This book provides a guide to the many different regions of the R platform, with a focus on doing what is required of the Data Scientist. It is comprehensive, beginning with basic support for the novice Data Scientist, moving into recipes for the variety of analyses we may find ourselves needing. On completing an install of R (which may take only a few minutes) you are ready to explore your data and beyond. All of the different types of data analyses are covered in this book, including basic data ingestion, data cleaning and wrangling, data visualisation, modelling and evaluation in order to discover new knowledge from our data. Tools for developers of systems to be deployed are well represented. Your donation will support ongoing development and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0. About this Book 20210314 This book has been a work in progress since 1995, just as Data Science continues to develop and expand into our lives. Every section (eventually) will begin with a date that indicates the currency of the section—when it was last reviewed and/or updated. Since beginning the survival guide books in 1995 they have grown in all kinds of directions. My original aim was to capture useful notes for the varied and many common tasks we find ourselves doing as data scientists (or data miners back then). I structured the book as one page nuggets of information. That is, each section within a chapter targeted a single printed page, and focused on a single task. This was the origin of my OnePageR Desktop Survival Guides. It seems to have worked well over the years, from my personal use and your feedback. This material has also lead to the publication of two books. Readers are invited to send corrections, comments, suggestions, and updates to me at [email protected]. Your feedback is most welcome and will be acknowledged within the book. A pdf version of this book is available for a small financial donation which goes towards supporting the development and availability of the book. Please visit Togaware for the details. The html version contains the same material and remains freely available from Togaware Production This book is produced using bookdown. Emacs is used to edit the text. Many will be using RStudio to edit their bookdown documents, which is a generally more friendly environment and is the environment of choice for bookdown support. I’ve used Emacs since 1985 and as a fully extensible “kitchen-sink” type of editor, it has served me well for over 35 years, despite numerous flirtations with “better” editors over my career. RStudio and Visual Studio Code come close. Bookdown is an rmarkdown based platform for intermixing text with executable code (like Python, R and Shell code blocks). Rmarkdown itself utilises the simple markdown syntax to markup the sections of a document. After running knitr over the rmarkdown material a markdown document is produced. Pandoc is then utilised to produce html which is published on the Web. For the pdf output pandoc utilises LaTeX, converting the markdown into LaTeX markup, with xetex used to then convert that to pdf. All these tools are open source software and available on multiple platforms. Many books are today being written using bookdown. Examples include Data Science at the Command Line (github); Efficient R Programming (github). What’s In A Name GNU/Linux refers to the GNU environment and the GNU and other applications running in that environment on top of the Linux operating system kernel. Ubuntu and its underlying base distribution Debian are complete repository based distributions which include many applications pre-built for the particular choice of operating system kernel. The repositories house pre-built packages ready to be installed. X Window System is the common windowing system used in Ubuntu and is a separate complementary component to the operating system itself. Microsoft Windows (or MS/Windows and less informatively just Windows) usually refers to the whole of the popular operating system, from kernel to applications, irrespective of which version of Microsoft Windows is being run, unless the version is important. Microsoft Windows is one of many windowing systems and came on to the screen rather later than the pioneering Apple Macintosh windowing system and the Unix windowing systems. We will refer to MS/Windows version 10 as the last release of this Microsoft operating system, which going forward has snapshot releases rather than new versions. Your donation will support ongoing development and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0. Organisation of the Book 20200218 Individual chapters aim to be a standalone reference using a one-pager style whereby each individual page aims to be a standalone guide on a specific point or topic. This one-page concept comes from my OnePageR book for data science, artificial intelligence, and machine learning and has been quite successful there. So sit back and enjoy the freedom and liberty that comes with free software. Also, please consider becoming an active part of the community that is making computers and the applications they run and the data they collect a benefit to society world wide, rather than a privilege of the few. Your donation will support ongoing development and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0. Acknowledgements 20210223 There are many people to thank for sharing these tools, their knowledge, and their encouragement in many different ways. Indeed, the open source and especially the R and now the tidyverse communities are chracterised by their willingness to share for the good of us all, and many folk have also contributed directly and indirectly to this book through their sharing. Their contributions are acknowledged throughout the book, but there are always gaps. To all who share openly, thank you. I have learned so much from this community over more than 30 years. Your support for maintenance of this book is always welcome. Financial support is used to contribute toward the costs of running the servers used to make this book available. Donations can be made through PayPal at https://onepager.togaware.com/ The following are acknowledged for their recent support of the book: David Montgomery, Arivarignan Gunaseelan, Clemens Kielhauser, and Ricardo Scotta. Your donation will support ongoing development and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0. Waiver and Copyright 20210314 Writing of this book began in 1995 and continues to be updated to this day. It is copyright by Togaware and licensed under the Creative Commons Attribution-ShareAlike 4.0 license. It is made freely available to serve as a useful resource for users of Free and Open Source Software in the hope that it serves as a useful resource. The procedures and applications presented in this book have been included for their instructional value. They have been tested at various times over the years but are not guaranteed for any particular purpose. We also note that functionality of different packages can change over time and whilst we make an effort to update the material the sheer volume presents a challenge. The publisher, togaware.com, does not offer any warranties or representations, nor does it accept any liabilities with respect to the programs and applications. Copyright © Togaware Pty Ltd Your donation will support ongoing development and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0. 1 Data Science 20200104 Science is analytic description, philosophy is synthetic interpretation. Science wishes to resolve the whole into parts, the organism into organs, the obscure into the known. (Durant 1926) Today we live in a data rich, information driven, knowledge strained, and wisdom scant world. Data surrounds us every where we look. Data describes every facet of everything we know and do. Today we are capturing and storing more data electronically (i.e., digitising the data) at a rate we humans have never before been capable of. We have so much data available and even more yet to be digitised from the world around us, and so much of the digitised data yet to be analysed. Most of our work today is about data, and perhaps it always has been. Data science is a broad church capturing the endeavour of analysing data and information by appropriately applying an ever changing and vast collection of techniques and technology to deliver knowledge to be synthesised into wisdom. The role of a data scientist is to perform the transformations that make sense of it all in an evidence based endeavour delivering the knowledge deployed with wisdom. The data scientist acts with humanity and philosophy to synthesise knowledge into wisdom whereby we thrive to resolve the obscure into the known. It is this synthesis that delivers the real benefit of the science—whether that benefit be for business, industry, government, environment, but always for humanity. A data scientist brings together a suite of skills to solve problems in a data driven way. It is not the only way to solve problems, and indeed may not even be the best way to solve problems, but it is the current best way to deal with many real world use cases today. In the future we will see more focus on knowledge representation and causal reasoning, but for now we can do a lot with data. References Durant, Will. 1926. The Story of Philosophy. 2012th ed. Simon; Schuster.