Applying the open/closed principle in R

There’s a programming phrase that says, “software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification.” The phrase is commonly called the “open/closed principle.”

I’ve been interested in the concept lately because I build and manage an analytics system written in R and I’ve learned that it’s easy to paint yourself into a corner with a production system on which people rely. Eventually you find yourself asking, “Hey, wouldn’t it be nice if this function had this extra feature?” But then realize, “… wow, making this change would break a lot of downstream code.” You’re stuck. The function is not open for extension.

Like many data scientists, I’ve had the pleasure of learning tried and true programming princples on the job. I’m lucky to work with a great set of engineers and as I read their Python code I keep seeing the function arguments *args and *kwargs everywhere. “What are these strange looking arguments?”

It wasn’t until I attended a very useful talk by Brett Slatkin at Pycon 2015, that I fully understood *args and *kwargs are Python’s equivalent to R’s ... (spoken ‘dots’). Brett discussed the various ways that *args and *kwargs (positional arguments) could improve code readability and extensibility.

Before the talk, I was familiar with R’s ..., but only in that it meant I could pass an unlimited number of arguments to functions like mean(x, trim = 0, na.rm = FALSE, ...). The difference is subtle but “dots” enables mean to accept both a vector mean(c(1, 2)) and/or mean(1, 2). The difference is so small that it’s very easy to miss.

Armed with my new found knowledge of overloading, I set to work extending a design-pattern I often use for working with event log data. Here’s a reproducible example that you can use to follow along and learn about ... as well.

# install.packages('dplyr') # the dplyr package is very useful for working with databases
library(dplyr)

To create an example, I’ll generate a bit of seed data. The first instance is a simple event log. The second is a fanout of properties related to the event log.

log <- tbl_df(
  data.frame(id = 1:10,
             user_id = rep(1, 10),
             name = c('visit')))

# id user_id  name
# 1       1 visit
# 2       1 visit
# 3       1 visit

properties <- tbl_df(
  rbind(
    expand.grid(id = 1:10,
                key = 'page',
                value = '/'),
    expand.grid(id = 1:10,
                key = 'referrer',
                value = 'google'))) %>%
  arrange(id, key)

# id     key  value
# 1     page      /
# 1 referrer google
# 2     page      /
# 2 referrer google
# 3     page      /
# 3 referrer google

Next I’ll create an sqlite database and INSERT the seed data to create an example that’s similar to how one might interface with an analytics database.

db <- src_sqlite('db', create = T)
copy_to(db, df = log)
copy_to(db, df = properties)

events <- function(named = NULL, ...) {
  if(is.null(named))  stop('need non-NULL arg: named')

  dots <- list(...) # It's handy to turn dots into a list
  # source: http://stackoverflow.com/a/5896544/1408640

  if(is.null(dots$key))
    return(
      tbl(db, 'log') %>%
        filter(name %like% named)
    )

  if(!is.null(dots$key))
    return(
      inner_join(events(named), # notice events() is called recursively here,
                                # I'm not yet sure if this is brilliant or very stupid thing to do.
                 log_properties() %>%
                   filter(key %like% dots$key))
    )
}

I find it very useful to create syntaxes in R. events(named = 'fun things') is easy to read and is very close to spoken English. In his book On Lisp, Paul Graham writes about building syntaxes towards a problem and then finally solving the problem. I like that philosophy and find it very useful for keeping code readable and composable.

log_properties <- function() {
  tbl(db, 'properties')
}

# Just a simple event
events('visit')

# Source: sqlite 3.8.6 [db]
# From: log [10 x 3]
# Filter: name %like% "visit"
#
# id user_id  name
# 1   1       1 visit
# 2   2       1 visit
# 3   3       1 visit
# 4   4       1 visit
# 5   5       1 visit

Now, overloading the events function with a keyword to filter on keys is natural, though I haven’t added it here, filtering on value is another useful extention.

# Filtering visit referrers
events('visit', key = 'referrer')

# Source: sqlite 3.8.6 [db]
# From: <derived table> [?? x 5]
#
# id user_id  name      key  value
# 1   1       1 visit referrer google
# 2   2       1 visit referrer google
# 3   3       1 visit referrer google
# 4   4       1 visit referrer google
# 5   5       1 visit referrer google

# Filtering page visits
events('visit', key = 'page')

# Source: sqlite 3.8.6 [db]
# From: <derived table> [?? x 5]
#
# id user_id  name  key value
# 1   1       1 visit page     /
# 2   2       1 visit page     /
# 3   3       1 visit page     /

Adding functionality is also much easier, as I could always start using a new keyword. I could add a new section to the body of the function and check for !is.null(some_new_keyword) without disturbing downstream calls. For example,

events <- function(named = NULL, ...) {
  if(is.null(named))  stop('need non-NULL arg: named')

  dots <- list(...) # It's handy to turn dots into a list
  # source: http://stackoverflow.com/a/5896544/1408640

  if(!is.null(some_new_feature))
    do something ...

I’m still a beginner when it comes to coding in an “open/closed” way, so if you have suggestions for ways that I can improve the code I shared here please get in touch on Twitter! @statwonk.

@statwonk

Applying the open/closed principle in R

May 27th, 2015