ebook img

Introduction to R & R Studio - University of Adelaide PDF

37 Pages·2014·0.39 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Introduction to R & R Studio - University of Adelaide

Introduction to R & R Studio Steve Pederson Bioinformatics Hub, Level 4, Santos Petroleum Engineering Building, University of Adelaide, Adelaide, South Australia 5005 November 7, 2014 Chapter 1 Introduction 1.1 Welcome Thank you for your attendance & welcome to the Introduction to R & R Studio Workshop. This is a free offering by the University of Adelaide, Bioinformatics Hub which is a centrally funded initiative from the Department of Vice-Chancellor (Research), with the aim of assist- ing & enabling researchers in their work. Training workshops & seminars such as this one are an important part of this initiative. The Bioinformatics Hub itself has a web-page at http://www.adelaide.edu.au/bioinfor matics-hub/, and to be kept up to date on upcoming events and workshops, please join the internal Bioinformatics mailing list on http://list.adelaide.edu.au/mailman/listinfo/ bioinfo. Today’s workshop has been prepared with generous technical support & advice provided by Dr Jono Tuke (School Of Mathematical Sciences), Dr Dan Kortschak (Adelson Research Group), Klay Saunders (Undergraduate Placement, Bioinformatics Hub) & Associate Profes- sor Gary Glonek (School Of Mathematical Sciences). The tutors today are Steve Pederson (Bioinformatics Hub) & David Price (School Of Mathematical Sciences). 1.2 Course Summary In today’s workshop we will be introducing you to some of the basic concepts which will enable you to utilise the Statistical Software Environment R. The emphasis will be on data types & key concepts, but with an additional focus on data exploration through plotting. As most people will come from a biological background today, there will be minimal statistics, however if you have any questions in this area feel free to ask the tutors. Tohelpyouwiththematerial, thereareafewscreencastsonYouTuberecordedbyJonoTuke 2 from the School of Mathematical Sciences. Links will be given at the appropriate sections of the text if you’d like to view them before stepping through the sections on your own machine. The screencasts are recorded in 1080 HD so if they are poor resolution, alter YouTube to the higher setting. Learning R is also just like every other piece of software you use. The more you use it & explore it, the more proficient you become. The amount of uses for R is mind-boggling and even those of use with years of experience have no idea about some of these uses. Take as muchoraslittleasyou’dlikeoutoftoday’ssession. Thereisalotofinformation, andinstead of focussing on one particular type of analysis, we’ve tried to keep it more general. There is no rush to get through the material, and please take your time to explore everything you’d like to. Ask as many questions as you’d like as we are here to help. Questions for you to answer throughout the session are given in separate boxes. At the end of this workshop you should be able to: • Open RStudio. • Load data into R. • Understand the different types of variables. • Get some basic plots and summary statistics. • Save your commands for later use. We hope today’s session will be useful in enabling you to continue and to advance your re- search. 1.3 R Studio Today’s session will be entirely within the workspace provided by R Studio. This type of in- terface is commonly known as an Integrated Development Environment or IDE, and R Studio has become a very powerful tool for those of us working in R. In short, it provides an easy way to access all of the information & objects that you need in an easy & straight-forward manner. The look & layout of R Studio is also virtually identical regardless of whether you are operating within Windows, Linux or Mac. IfyoudonothaveeitherR,orRstudioinstalled, pleasegotohttp://cran.r-project.org/ to download the appropriate version of R for your computer, and http://www.rstudio.com/ for R studio. These are relatively large installs so they make take a few minutes to download & install. Both should still run smoothly even on older computers too. 3 Once installed, open R Studio then select File > New > R script This should give you a new window to enter commands into which looks like Figure 1.1. Figure 1.1: The basic R Studio layout 1.3.1 The R Studio Layout Jono’s YouTube Tutorials: http://youtu.be/8gnB138nUnE As explained the above video, in the R Studio layout you will see four regions. Whilst there may be a slight discrepancy here or there depending on which version of RStudio you have installed, the core ideas will remain the same. 1. Script Window. This is where we enter our code, which is usually run either in chunks, or top to bottom. Most of what you type will be written in here, and we save these collections of code with the suffix .R to let us & the operating system know that this is code for R. For those not familiar with programming lingo, a continuous set of in- structions written as code is what we refer to as a script. As well as code, you can also prefix lines in this window with a hash (i.e. #). These lines then become a ‘comment’ which is not executed, but can be extremely helpful for explaining what your code is about to do. This is a vital part of coding when starting out as you can write notes to your future self who will often be looking back it at some point in the future, trying to figure out what the heck your less-experienced current self was trying to do. 4 2. Console. Once we have entered code in the Script Window we send it to the console where it is executed. We can also enter code directly here but by using the Script Window above, we will have a record of our code that we an save & refer back to in the future. 3. Workspace. You will notice two tabs in this window. One labelled Environment or Workspace, with the other tab labelled History. The Environment/Workspace tab con- tains a list of all the objects & variables contained in the current R environment. This is essentially like a desktop or workspace which will contain all of the data files & an- alytic outputs from your analysis. At the start of the session this will be empty as we haven’t doe anything yet, but as we begin to create & load data objects, this window will become populated with the names of all of these objects. The History tab will contain a list of every command that has been executed in the Console. This can be a very handy backup if you’ve been lazy & just entered something into the Console without first writing it into the Script Window, but then you can’t remember what you’ve done. 4. Files. The lower right window contains 5 more tabs & these are 1) Files - The file & directory structure of your current working directory, 2) Plots - Where any plots you generate are displayed, 3) Packages - Any packages you have loaded. (We’ll explain this more later), 4) Help - Where you can find the help pages for every function installed in your version of R, & 5) Viewer - Used for displaying locally generated .html content. We won’t really need to know much about this tab today. 1.3.2 Using the Console For Basic Calculations Jono’s YouTube Tutorials: http://youtu.be/yqD6vD6-lLo Before we delve too far into the session, we should have a quick exploration of the console. In essence, this can be used as a calculator and also has a series of inbuilt functions. In the following section and for the remainder of this document, note that any comment lines are prefixed with a hash symbol (#). You don’t need to enter these lines, they are just there to explain what you are doing. Note also that the code for you to type is in the shaded regions, whilst the results (which you won’t need to type) are in black. R usually outputs any results prefixed by a number in square brackets, e.g. [1]. First try some simple addition like 2 + 2 2 + 2 [1] 4 We can also perform subtraction, multiplication and division 2 - 2 [1] 0 5 2 * 2 [1] 4 2 / 2 [1] 1 To take the power of something, we use the caret symbol (ˆ) 2 ^ 3 [1] 8 There are also a series of inbuilt functions in the R console, so let’s have a look at some familiar ones to begin with. To find the square-root of a value, we can use the function sqrt() or we could take it to the power of 0.5. Note that R is a case-sensitive language so sqrt is different to SQRT. sqrt(4) [1] 2 # Or alternatively 4 ^ 0.5 [1] 2 Note that in the above, we passed the value 4 to the function sqrt by enclosing it in round brackets. This is how we pass data to a function in R. Spaces are typically not meaningful in R but the convention is to place these brackets directly after the function name, unlike in some command-line based functions. We also use square brackets [] (covered in the next chapter) and curly brackets {}, with the latter’s usage being beyond the scope of today’s session. Similarly, there is an inbuilt function for finding the logarithm of a number. The default is to find the natural logarithm (i.e. ln), but we can change this by specifying the base as in the second example. log(8) [1] 2.079442 # Now try using base 2 log(8, 2) [1] 3 1.3.3 Accessing Help How did we know to put the base after the comma in the above example? There are multiple ways to get help in R Studio, with some of them being more useful than others. R help pages 6 have reputation for being difficult to understand for new users, so hopefully we can help you understand what they mean. As this is an open-source & free software the quality of the help pages will vary from extremely useful to downright uninformative, but they will all follow the same format. In the console type ?log Notice that in the lower right window, the “Help” tab became active and you have found the relevant help page for Logarithms & Exponentials. Under the Description section is a quick description in words of the functions covered in this help page. Next we have Usage which shows what arguments to provide to the function. Any default values are shown here. In the case of the log function you will see log(x, base = exp(1)). The value x refers to the variable that you give the function, such as the value 8 above. After the comma is the default value for the base for which the logarithm is calculated. If we provide a value after a comma, that will be the base used by the function, but if we leave it blank with no comma the value is found using base e. (The function exp(1) is an alternative way of expressing the mathematical constant e =2.7182818.) The Arguments section explains what values go in each of the positions within a function, i.e. these are the function arguments. Some of the remaining sections are a bit too advanced for what we need to know at this stage, but the section Value can be important as it gives us an idea of what the output looks like. We’ll talk about vectors in a minute so don’t worry about this section for now, unless you have some programming experience already. Now that we know the structure of the log function we know that it can take either 1 or 2 arguments. Go back to the Console and type the word log but instead of hitting <Enter>, hit the <tab>key. This performs a search of all the available functions which begin with the three letters ”log.” If you can’t remember the name of a function this can be very helpful. For example, you might guess that the function for finding a square root would begin with ”sq” but you mightn’t be sure of the rest. Typing ”sq<tab>” would help you find the correct function. If you know the function name, but can’t remember the correct arguments you can enter the name of the function, followed by a bracket “(” and the tab key. Try entering “log(<tab>” and you will see the two arguments x and base appear in a pop-up window along with part of the description from the Arguments section of the complete Help page. 1.3.4 Where Do These Functions Come From? Fortunately two functions don’t come together, find each other attractive & create more little babyfunctions. ThatwouldleadtotheTerminatorbecomingreal, whichisaworldweshould avoid. Instead, go back to the help page for the function log and notice that at the very top of the page you can see log {base}. This means that this function is included as part of the base distribution of R. Now enter: ?median Noticethatatthetopofthepageitnowsaysmedian {stats}whichtellsusthatthisfunction isn’t part of the base distribution, but comes from the package stats. Head to the packages Tab in this window & scroll down. This is a list of all the packages which you downloaded 7 when you installed R & each package is really just a collection of these types of functions. Notice that there is a tick next to the package stats, which lets you know that this package has been loaded & all the functions it contains are available to you. In the Console, type med<tab> and you’ll see all the available functions that start with these three letters. Now go to the Packages tab again and untick the stats package. Repeat the search using med<tab> and you’ll notice that these functions no longer appear. These are pretty important functions, so we can reload the package either by clicking where the tick was, or by going back to the Console and entering: library(stats) For every analysis we perform, we’ll require the appropriate functions and as we become more experienced we’ll know which package they are located in. The above syntax, i.e. library(package.name) is how we load the library of functions that each package contains into the current environment for us to be able to use. If we’re not sure where to find a specific functionwecanalsofindoutthisinformationbyusingadoublequestionmarkintheConsole. Let’s try it using the function median and you’ll notice that in the Help tab, every function that matches the string “median” somewhere in it’s help page will be returned. ??median 1.3.5 Moving Between Windows Although we can easily move from each window in R Studio to another by clicking on it, we will inevitably be in the wrong place & start typing something. Click on View in the Drop-Down Menu at the top of the screen, and you will see the shortcuts for moving between windows. Ctrl+3 will take you to the Help page, whilst importantly Ctrl+2 will shift the flashingcursor(i.e. focus)totheConsole. ThismenureferstotheScriptWindowas“Source”, so Ctrl+1 will shift focus to this window. If you’re about to type data in a specific place, it can be a good habit to start any entries in the Console with Ctrl+2. It’s very easy to accidentally overwrite code in the Script Window if you’re not paying attention, so this is an excellent habit to develop early. 1.3.6 History Itmayalsobeworthnotingatthispointthateverythingwehavedoneisdisplayedinthe“His- tory” tab of the upper-right window. This will continue to record every entry we make in the console for the entire session. We can also access these entries by using the down & up arrows in the Console window. Switch focus to the Console by using Ctrl+2 and use the up arrow to scroll through your previous commands. Notice that these match the History exactly, and thiscanbeaveryusefulfeatureforcorrectinganinstructionwhichfailedduetoasingletypo. We can also move within a line on the Console by using the sideways arrows, and the <Esc> key will delete any code currently entered at the prompt. 8 Chapter 2 Vectors Jono’s YouTube Tutorials: http://youtu.be/53kj44XUKzQ Now that we know where to look for help, we should turn our attention to how we view data in R. Most of us are probably used to Excel where we can see all of the data by just scrolling down the spreadsheet. Unfortunately Excel has strict limits on the size of your spreadsheet, whereas in R the only bounds are defined by the memory of your computer, and even then many packages have developed ways of enabling you to access datasets with millions of data points even on very limited computers. This is one of the many advantages of R, in that we can handle very large datasets with relative ease. 2.1 What is a Vector? In R we often have data contained in vectors, and for those who start to get nervous around mathematics these are just a collection of values. More formally, they are objects in the Environment (or workspace) which are one-dimensional and all of the elements of the vector will be of the same data type. We can also think of them as being analogous to a single column of data in an Excel spreadsheet. A simple vector is the numbers 1 to 10, and we can obtain this vector just by going to the Console (Ctrl+2) & typing: 1:10 [1] 1 2 3 4 5 6 7 8 9 10 Let’s get a bit more serious now and move to the Script Window by entering Ctrl+1. Type the text x <- 1:10 then hit Ctrl+Enter. This will send the current line from the Script Window to the Console. We have just created our first vector in the Environment! The symbol ”<-” is the most common way of sending values to an object in R, and you can almost imagine it as being an arrow with the values 1:10 being sent to x. We could have also entered x = 1:10, but the 9 first way is preferable due to it’s flexibility, and as it also avoids any confusion with logical tests (covered later). The deeper reasons for this are a bit too complex for today’s session and once you get used to this way of doing things, it will become quite intuitive. Note that when we created x, the values didn’t automatically appear in the Console printout. This is surprisingly helpful as we will often be creating very large objects and it’s better to view the previous few lines of code than thousands of lines of meaningless values. We can have a look at our vector in the Console by now just asking for x, or by using the command print(x). As you can imagine if x was a very large vector (e.g. 1:1000000) this is something which we’d need to be careful about. Now look at the Environment (or Workspace) tab in the upper right window and you will see x listed there, along with a brief summary of what the vector contains. The phrase int is telling us that all the values in the vector are integer values. The numbers in the square brackets [1:10] are telling us the dimensions of the object, i.e. that it starts with position 1 and finishes at position 10. This is a very useful way of quickly checking what is contained in an object. As well as displaying the entire vector in the Console as we did above, we could just have shown any one of the values by placing a number in square brackets after the x. For example, if we wanted just the second value or the first 3 values in the vector, we could type: # Just display the second value in the vector x[2] [1] 2 # Now show the first three values x[1:3] [1] 1 2 3 An important behaviour to also observe is that if we’d tried to obtain a value outside the length of x, the request wouldn’t fail but would instead return the value NA. x[11] As you become more proficient using R to write long processes, slightly counter-intuitive behaviours like this mean you’ll need to constantly perform error checking to ensure you haven’t done anything silly by accident, and then passed nonsense down to any number of downstream processes. 2.2 Working with Vectors This is one ofthe great strengths of R, andhowit really became established as akeyplayer in theworldofStatisticalSoftware. (Andthefactthatit’sfree.) Nowthatwehaveaconvenient vector, we can just pass this to a wide variety of functions which are in the packages we have loaded. 10

Description:
Nov 7, 2014 Thank you for your attendance & welcome to the Introduction to R & R Studio Workshop. This time we'll use the more detailed function ggplot() R Graphics Cookbook by Winston Chang (O'Reilly) is an excellent resource
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.