You’ve painstakingly gathered data from 569 patients; 350 out of whom were receiving an experimental treatment. Now you need to estimate the average effect of this treatment. It’s time for some data analysis!
Some may welcome this challenge, while others bristle at the thought of having to deal with complex calculations. Many researchers and students alike have found great help in this area through a free, open-source, and easily accessible software simply called R. R continues to make itself a friend of both master data crunchers and those for whom data analysis is a scary task.
Odds are you have a colleague or professor who swears by it, and for good reason. Many experts and relative newcomers alike continue to switch to R from conventional statistical packages such as SPSS, SAS, and Stata, because of its flexibility and data visualization capabilities, not to mention the unbeatable price ($0). Let’s take a look at how R can add to your research capacities and make your life a bit more efficient.
What is R?
The R Project’s website says “R is a free software environment for statistical computing and graphics.” Yet it’s far more than a statistical package: R is in fact a programming language that happened to be developed especially for statistical analysis.
But don’t let that scare you. Professors Robert Gentleman and Ross Ihaka came up with the R Project within the Department of Statistics of the University of Auckland in New Zealand.
They introduced it in a 1996 article (pdf) now cited over 4,500 times. Today, R is taught in the universities all around the world, and its open-access nature has led to continued development. R continues to have a hugely devoted following and the community of developers ensure its longevity. It also stacks up very well against the competition.
How to Get R and Get Started
To use R, you simply need to download it from the R Project website. Once it’s installed you’ll be almost ready to do your analysis. Why almost? First, you may find it much more comfortable to use some form of integrated development environment (IDE). For example, RStudio, naturally also free and open-source, has a host of features that make R easier to navigate and manipulate. The standard R download gives access to only base functionality of the tool.
But to enjoy the full power of R, you’ll need to download some additional packages. These packages include sets of certain helpful functions, developed by R programmers around the world, to efficiently solve specific problems. Among these are separate packages for processes such as data cleaning, creating sophisticated graphs, testing different theoretical models, building forecasts, and manipulating time series data. Armed with a general understanding of R, let’s see how this software can boost your data analysis for your research.
6 Reasons R Rocks for Scientific Research
1. Free and open-source
Everyone loves a bargain, and many value open sharing of technology. Therefore, the free and open-source nature of R is probably the foremost reason many researchers around the world select R. Anyone can pop the hood and examine the source code to see exactly what it’s doing.
This also means that you, or anyone else with the motivation and ability, can promptly fix bugs and make modifications as you wish. This can eliminate the need to wait for the vendor to find and fix the bug and put out an updated release.
2. Reproducible research
You’ve loaded your data on your experimental patients into SPSS, rearranged it as needed, inspected the summary statistics, deleted several cases with missing values, run the model, and observed some very strange results. You suspect there was a mistake in your data analysis process and want to discuss this issue with your colleague.
Do both of you now need to restart the analysis from the very beginning to discover where the error is? What if instead you had all your steps written in a short script that could be easily inspected for any errors? What if you or your colleague could repeat all the steps you’ve taken by simply pressing the Run button? Wouldn’t it make your life much easier?
Well, this is something R can do for you. You simply create scripts that include all steps of the analysis starting from loading data into R and finishing with preparing graphs and tables for reporting the results. Such a script allows easy reproducibility of your research. You can quickly try many different ideas, correct any issues that arise, and update your analysis if needed. And all this can be done simply by changing a few lines of code and clicking “Run”.
3. Extremely easy data wrangling
R has several packages that hugely simplify the process of preparing your data for analysis. You may have your data stored in the .csv or .txt file, in Excel spreadsheets, in relational databases, or as an SAS or Stata file. R can load these various types of files with just one line of code. The process of data cleaning and transforming is also straightforward.
One line of code – and you create a separate dataset without any missing values, another line – and you impose multiple filters on your data. With such powerful capabilities, the time you spend on preparing your data for analysis can decrease significantly, giving you more time to spend it on the analysis itself.
4. Advanced visualizations
Even the basic functionality of R allows you to create histograms, scatterplots, or line plots with only a tiny bit of code. These are very convenient functions for visualizing your data before even starting any analysis. In a few seconds you can actually see your data and get insights that are not visible from the tabulated data alone.
However, if you spend some time learning more advanced visualization packages, such as ggplot2, for example, you’ll be able to build some very impressive graphs. R provides seemingly countless ways to visualize your data. These graphs will look very professional. And you’ll get access to a whole host of extra options, such as adding maps to your visualizations or making them animated.
5. Quick implementation of new theoretical approaches
At a higher level, R packages can be developed by anyone who learns the R programming language. This is the beauty of open-source. When a new theoretical framework appears, there’s no need to wait for the vendor to embed this framework into the software. The researcher or any R programmer can independently create the corresponding functions.
This process is in fact so fast, and the R community so active, that when new research is published it’s likely to already have an accompanying R package.
6. Easily extends to serve your specific needs
With R you’re not “locked in” to the defined options of the statistical package. As you get more comfortable with this programming language, you can write your own functions to satisfy your specific needs. You don’t need complex packages for this. Quite simple functions may be able to make your data analysis much more efficient.
For example, within your research activities you may need to download multiple files of a similar structure, and then make the same manipulations with all of them (e.g., cleaning, filtering, selecting particular variables). So, instead of repeating the same actions for each of the files, you can write a function to do this for you. All you’ll need to do is provide the file name to the function and press “Enter”. The data from the file will be processed automatically as you programmed. If it’s not quite working right, you can fine tune it.
3 Reasons Why Some Researchers Prefer Other Statistical Tools over R
1. Need to learn programming
R doesn’t have a point-and-click interface, as in some other popular statistical tools. Therefore, if you decide to use R for your data analysis, you’ll need to invest some time in learning the R programming language. Many professors do, however, provide the necessary steps. It helps if you start thinking like a programmer, though that is not always essential.
The good news is that you need not be a black belt in R to do your particular research. Some basic knowledge of R and of the relevant packages should be enough. And this can be learned quite rapidly. You can also find plenty of tutorials to teach you how to use R in your specific research area. Reflecting the openness and enthusiasm of the R community, there are also many great videos out there. Here’s just one of them.
2. Lower speed
R is comparatively slower than many of the other popular statistical packages, especially when dealing with big datasets. The basic principle of R comes from programming languages developed well back in the 20th century.
Owing to that, its design assumes that data have to be stored in physical memory, which can lead to issues when working with very large datasets. There are some ways to optimize code and make data analysis much faster, but generally speed is one of the well-known challenges R faces.
3. No official support
All packages available from the R Project’s website are required to have sufficient documentation. However, with thousands of packages developed by thousands of programmers from all over the world, it’s hard to guarantee that all functions within all of the packages are well documented.
There’s also no official support for R users. Even though you easily get help from the large community of R users, you can’t be sure that you’ll get an authoritative and qualified response to your inquiry within a reasonable time period. However, this is not a great risk, and searching and connecting with others can be part of the fun.
To tie this together, R has many features that can help you make your data analysis much more efficient and reliable. However, you’ll first need to spend a bit of time and effort on learning a new programming language. If you love learning, the sky’s the limit here, as there are seemingly infinite free resources available for the purpose.
Go ahead and geek out. Then, with this new skill, you’ll be able to clean and manipulate your data efficiently, test the newest theoretical models, make fully customized visualizations, enjoy reproducibility of your data analysis – and you sure can’t beat the price.