Working with R: Tips and tricks for developers

 

Jon Ryser

Jon is an experienced, results-driven engineer who writes great code. He is confident and experienced at managing the unexpected.

Updated Sep 15, 2022

R is not a new language. It is an implementation of an older language called S that was initially developed between 1975 and 1976 by John Chambers. R was first conceived in 1992 and initially released in 1995. R has been taught to data analysis students in different fields for at least a decade. Because they learn about it in school, many data analysts who are not developers believe it is their only resource. 

R has a number of advantages for this kind of work. It’s focused on making statistical analysis and visualization easy for humans (resulting in performance tradeoffs, which we’ll discuss). It’s “fast enough” for a large swath of typical analytics tasks. The code is intuitive from the perspective of the data analyst. And, it’s highly flexible–you can change a lot of aspects of a program as you go, exploring data and iterating on a problem. 

Challenges of working in R

Although data scientists love it, as software engineers, we wouldn’t necessarily choose to work in R. There are mature statistical computing technologies that we can leverage to accomplish many of the same goals without the performance and DevOps disadvantages, such as Python or Julia. However, R is very common, and if you’re working on analytics projects–especially in finance, government, or academia–you’re likely to run into it eventually.

R was not designed as a general-purpose programming language. Unfortunately, it often gets used in contexts where it’s not optimal simply because it’s what people know. By the time we enter the picture, we’re stuck with it and have to figure out how to make it work as well as possible.

We recently engaged in a project where the client had an existing codebase written in R. Jumping the “R ship” didn’t make sense. We needed to work within the language and find the best possible solutions. Here, we share what we learned so you can avoid some challenges we had to overcome the hard way. 

Lack of documentation

The initial issue was documentation. While R has a large community and many packages that provide additional functionality, the documentation can be a bit spotty. It is challenging to find what a function returns or what arguments a specific function accepts. Since the language has been around for a long time, there are first-page Google results for outdated functions that do not indicate they are deprecated or what new function to use.

While there are many helpful blog posts on R, most of them are written by non-developers who are excited by the idea of a “function”. It’s great that these resources exist, but it can take a bit of digging to find deeply useful information.

Slow and manual package management

Package management in R leaves a lot to be desired. Packrat provides general package management functionality. Unfortunately, it tends to be slow as molasses. Each package is downloaded as source and then built. The source is stored locally as tar files. Whenever any one package is updated, all the packages are rebuilt–a very time-consuming process.

The Renv package is superior in many ways to Packrat and has gained significant popularity. Renv adds a lock file (renv.lock) that tracks and locks the dependency versions. It includes a script to bootstrap itself to ensure it is available at the beginning of the R session. It is still a far cry from seamless dependency management found in tools such as npm and yarn (for JavaScript).

Performance speed bumps

R is single-threaded. That’s right: with the vanilla distribution of R, you can run your app on a 64-core megaprocessor and it will poke along using only one of those cores. Additionally, R is an interpreted language, making processing of structures such as for-loops very slow. There are workarounds, but it’s still not going to match the performance of a compiled language. 

In some ways, the performance limitations are a natural result of R’s key benefit, which is that it was written to help people do statistical analysis, not to optimize computing performance. A compiled C# program can do things very fast, but very few humans could look at the assembly code and understand what’s going on. R, by contrast, is highly legible and flexible. You can change functions, methods, fields, and objects whenever you want without breaking the application. That makes it easy to iterate and solve problems on the fly–as many humans like to do. 

Unusual Syntax

The syntax in R might feel a bit strange. For example, in many common programming languages, the properties of objects are referenced using a dot (.), as in

ThisObject.property01

In R, the dot (.) is (mostly) just a string character. It is typical, and often preferred, to create variable names using a dot (.) as a word separator instead of camel-case or snake-case, as in

this.great.variable

Instead, a dollar sign ($) is used to reference a property name, as in

ThisObject$property02

Another syntax quirk that may be unfamiliar to modern developers is the assignment operator ( <- , -> , = , <<- , ->> ). The preferred assignment operator in R is <- .

Unfriendly Namespacing

Namespacing is challenging, global collisions are common, and passing function arguments are extremely “squishy”. The “squishy”-ness of function arguments can lead to code that is challenging to read (see http://adv-r.had.co.nz/Functions.html#function-arguments for more information.

Solutions that worked for us

Third-Party Package Management

There are many third-party packages available to include in an R application. These can provide additional functionality, such as database connection or data frame tools. Ensuring that these dependencies are consistent through environments and on each user’s machine will help the application to perform predictably.

We leveraged Renv for package management. We built a process for installing new dependencies and getting dependencies installed for the specific project into the app.

While Renv worked great for local development, it’s not supported in ShinyApps.io via RSconnect. The key was to maintain the renv.lock file for local development while ALSO maintaining project dependencies in the DESCRIPTION file under “imports”. We put a request in to RSconnect to support Renv, but have yet to receive a response.

A note of caution: we have seen folks recommending that you copy code from packages and paste it into your project to avoid the need to install the package as a dependency. However, this cuts your package code off from maintenance processes like bug fixes and updates. That’s a lot of copying and pasting, pretty much forever. I recommend that you always call the function from the package, even if it’s a little more work up front. 

Options for Improving Performance

To enhance performance in processing data frames, we dug a bit deeper than R. The solution often involves finding the right tool for the job, and knowing what NOT to do. 

  • Generally, I would avoid using “for” loops in R altogether. It’s hard to get them to run faster than “extremely slow.” When you need to iterate through a large data frame, it’s more efficient to convert it to a list and use purrr::map or purrr::pmap.
  • Using dplyr::rowwise might be fine for a small data frame, but it doesn’t scale well for large data frames. Vectors to the rescue! Using the “ifelse” function (which is vectorized) can solve a lot of problems. For example, instead of writing something like 

na_to_green <- function(value) {

    If (!is.na(value)) {

        return(value)

    }

    return(“green”)

}

 

result <- This.data.frame %>%

    dplyr::rowwise() %>%

    dplyr::mutate(column.name = na_to_green(column.name))

 

it is more efficient to write something like 

 

result <- This.data.frame %>%

    dplyr::mutate(column.name = ifelse(!is.na(column.name), column.name, “green”))

 

  • When doing joins, “slim down” the data by using dplyr::select or dplyr::distinct to limit the columns. Also, rename the columns during dplyr::select or dplyr::distinct.
  • When processing large data frames, limit the data as much as possible for the computation. If only a few specific columns are needed, do a dplyr::select and grab only those columns. More columns means more memory used.
  • R sometimes needs some help with memory management. Objects created in the global scope are added to the .GlobalEnv. They will stay there filling memory unless removed. If an object is needed in the global scope, remove it when it is no longer needed using

 

rm(the.global.object, pos = “.GlobalEnv”)

 

And then call garbage collection

 

gc()

 

Calling garbage collection after functions that create and / or process large pieces of data will help reduce memory usage.

  • Finally, you may be able to take advantage of enhanced R distributions such as Microsoft R Open, which supports multithreading, among other improvements.

R you excited yet?

R is pretty easy to pick up when you are using it for its intended purpose–exploring data and solving problems in an iterative, intuitive fashion. However, if you have to use it in the context of a modern application, it’s going to throw some barriers in your way. It’s just not built to meet those expectations. Don’t get too frustrated! Put on your 1996 developer hat and proceed from there.

We were able to overcome the challenges of working in R and are definitely stronger for it. R is very manageable and you can employ good engineering practices to make the most of it. If your product uses R and you’re stuck on what to do next, give us a shout. We’re happy to figure out how we can help.