There’s a programming phrase that says, “software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification.” The phrase is commonly called the “open/closed principle.”
I’ve been interested in the concept lately because I build and manage an analytics system written in R and I’ve learned that it’s easy to paint yourself into a corner with a production system on which people rely. Eventually you find yourself asking, “Hey, wouldn’t it be nice if this function had this extra feature?” But then realize, “… wow, making this change would break a lot of downstream code.” You’re stuck. The function is not open for extension.
Like many data scientists, I’ve had the pleasure of learning tried and true programming princples on the job. I’m lucky to work with a great set of engineers and as I read their Python code I keep seeing the function arguments *args
and *kwargs
everywhere. “What are these strange looking arguments?”
It wasn’t until I attended a very useful talk by Brett Slatkin at Pycon 2015, that I fully understood *args
and *kwargs
are Python’s equivalent to R’s ...
(spoken ‘dots’). Brett discussed the various ways that *args
and *kwargs
(positional arguments) could improve code readability and extensibility.
Before the talk, I was familiar with R’s ...
, but only in that it meant I could pass an unlimited number of arguments to functions like mean(x, trim = 0, na.rm = FALSE, ...)
. The difference is subtle but “dots” enables mean
to accept both a vector mean(c(1, 2))
and/or mean(1, 2)
. The difference is so small that it’s very easy to miss.
Armed with my new found knowledge of overloading, I set to work extending a design-pattern I often use for working with event log data. Here’s a reproducible example that you can use to follow along and learn about ...
as well.
# install.packages('dplyr') # the dplyr package is very useful for working with databases
library(dplyr)
To create an example, I’ll generate a bit of seed data. The first instance is a simple event log. The second is a fanout of properties related to the event log.
log <- tbl_df(
data.frame(id = 1:10,
user_id = rep(1, 10),
name = c('visit')))
# id user_id name
# 1 1 visit
# 2 1 visit
# 3 1 visit
properties <- tbl_df(
rbind(
expand.grid(id = 1:10,
key = 'page',
value = '/'),
expand.grid(id = 1:10,
key = 'referrer',
value = 'google'))) %>%
arrange(id, key)
# id key value
# 1 page /
# 1 referrer google
# 2 page /
# 2 referrer google
# 3 page /
# 3 referrer google
Next I’ll create an sqlite database and INSERT
the seed data to create an example that’s similar to how one might interface with an analytics database.
db <- src_sqlite('db', create = T)
copy_to(db, df = log)
copy_to(db, df = properties)
events <- function(named = NULL, ...) {
if(is.null(named)) stop('need non-NULL arg: named')
dots <- list(...) # It's handy to turn dots into a list
# source: http://stackoverflow.com/a/5896544/1408640
if(is.null(dots$key))
return(
tbl(db, 'log') %>%
filter(name %like% named)
)
if(!is.null(dots$key))
return(
inner_join(events(named), # notice events() is called recursively here,
# I'm not yet sure if this is brilliant or very stupid thing to do.
log_properties() %>%
filter(key %like% dots$key))
)
}
I find it very useful to create syntaxes in R. events(named = 'fun things')
is easy to read and is very close to spoken English. In his book On Lisp, Paul Graham writes about building syntaxes towards a problem and then finally solving the problem. I like that philosophy and find it very useful for keeping code readable and composable.
log_properties <- function() {
tbl(db, 'properties')
}
# Just a simple event
events('visit')
# Source: sqlite 3.8.6 [db]
# From: log [10 x 3]
# Filter: name %like% "visit"
#
# id user_id name
# 1 1 1 visit
# 2 2 1 visit
# 3 3 1 visit
# 4 4 1 visit
# 5 5 1 visit
Now, overloading the events
function with a keyword to filter on keys
is natural, though I haven’t added it here, filtering on value
is another useful extention.
# Filtering visit referrers
events('visit', key = 'referrer')
# Source: sqlite 3.8.6 [db]
# From: <derived table> [?? x 5]
#
# id user_id name key value
# 1 1 1 visit referrer google
# 2 2 1 visit referrer google
# 3 3 1 visit referrer google
# 4 4 1 visit referrer google
# 5 5 1 visit referrer google
# Filtering page visits
events('visit', key = 'page')
# Source: sqlite 3.8.6 [db]
# From: <derived table> [?? x 5]
#
# id user_id name key value
# 1 1 1 visit page /
# 2 2 1 visit page /
# 3 3 1 visit page /
Adding functionality is also much easier, as I could always start using a new keyword. I could add a new section to the body of the function and check for !is.null(some_new_keyword)
without disturbing downstream calls. For example,
events <- function(named = NULL, ...) {
if(is.null(named)) stop('need non-NULL arg: named')
dots <- list(...) # It's handy to turn dots into a list
# source: http://stackoverflow.com/a/5896544/1408640
if(!is.null(some_new_feature))
do something ...
I’m still a beginner when it comes to coding in an “open/closed” way, so if you have suggestions for ways that I can improve the code I shared here please get in touch on Twitter! @statwonk.