fread
to read delimited filesrbindlist
to combine tablesfwrite
to write delimited filesThis chapter covers R’s core data structures and syntax.
In R, all data come in vectors. If we enter x = 54.7
, then x
will be a numeric vector of length 1. Similarly, y = c("Yah", "Bada bing")
is a character vector of length 2. Type a vector’s name to see it, and use length
and class
to see those attributes.
# example data
x = c(TRUE, FALSE, TRUE)
x
# [1] TRUE FALSE TRUE
length(x)
# [1] 3
class(x)
# [1] "logical"
The bracketed [1]
indicates that we are seeing the first element of a vector. This printing pattern is handy with vectors too long for a single line, like c(LETTERS, letters)
. It is also a useful reminder that vectors are indexed starting from 1 (instead of 0 as in C, Python, et al).
Assign with =
or <-
. Or even like c(2,3) -> z
, though this is rare and hard to read in my opinion. If an existing vector is overwritten, no warning is given.
Try typing help.start()
. This will open a page of links to manuals including “An Introduction to R”, “The R Language Definition” and the “R FAQ” (under “Resources”). These are also available in pdf online.
Everything is documented in convenient wikified html pages, accessible with ?
or help
. For example, ?TRUE
, ?LETTERS
or ?length
. Backticks are sometimes required, as for ?`for`
and ?`?`
. Keep a close eye on the Description and Value sections, which concisely explain what to expect from a function. The Examples section always usefully illustrates functionality; it can be run in the console like example("LETTERS")
. Additional illustrations are available using demo
; type demo()
for a listing.
R doesn’t do error codes, only error messages and warnings. If confused by a message, a search online or a review of the relevant docs is usually sufficient.
Every vanilla vector has a single class. That is, every element of the vector contains the same type of data:
c(4, c("A", "B"))
# [1] "4" "A" "B"
4
is “coerced” to character
above, with no warning. Since coercion is so central to R and confusing for new users, I strongly suggest reading its documentation, in the Details section of ?c
.
The doc at ?c
also conveniently lists all of the vanilla (“atomic”) classes. The key ones are
TRUE
or FALSE
.L
at the end: 1L
, 2L
, etc. The L
will not be displayed in output.e
notation, 2
, 3.4
, 3e5
."bah"
, 'I say "gah"'
, etc.A list
is a special type of vector whose elements can be arbitrary objects, for example, a list of vectors:
# example data
L = list("A", c(1,2))
L2 = list(FALSE)
L3 = c(L, L2)
L3
# [[1]]
# [1] "A"
#
# [[2]]
# [1] 1 2
#
# [[3]]
# [1] FALSE
The [[1]]
printed here indicates the first element of the list, similar to the [1]
we see for atomic vectors.
?typeof
and the R internals documentation. While some fancy data structures like linked lists and unordered sets are absent in base R, they can be used through packages like Rcpp.
For lists, in addition to a length
, we also have lengths
, measuring each element:
lengths(L3)
# [1] 1 2 1
The class of an object can be tested with is.logical
, etc.
To compare two objects, use ==
and !=
. R will silently coerce the objects’ classes to match:
"4.11" == 4.11
# [1] TRUE
In addition, there are the usual inequality operators (>
, <
, >=
, <=
), which also apply to strings, using lexicographic ordering:
"A" >= "B"
# [1] FALSE
For complicated objects, like L3
above, class(L3)
is not very informative. Examining the “structure” of the object is usually more useful:
str(L3)
# List of 3
# $ : chr "A"
# $ : num [1:2] 1 2
# $ : logi FALSE
Sometimes, we need closer inspection for debugging. The functions unclass
(which removes the class) and dput
(which prints R code to create the object) are helpful for this. See 3.6.4.5 for an example relevant to date and time objects.
To explore the set of loaded objects, see the tools in 3.7.1.
A vector’s elements can be named:
c(a = 1, b = 2)
# a b
# 1 2
list(A = c(1, 2), B = 4)
# $A
# [1] 1 2
#
# $B
# [1] 4
The names of an objects’ elements can be accessed with the names
function.
c(A = 1)
, the equals sign is not interchangeable with the <-
, so don’t write c(A <- 1)
. Think of c
as a function with syntax like c(argname = argvalue, argname2 = argvalue2, ...)
. The equals sign has the special role of giving names to the function’s arguments. See 2.4.2 for details.
To assign names programmatically…
nms = c("a", "b")
c(nms[1] = 1, nms[2] = 2) # error
# Error: <text>:2:10: unexpected '='
# 1: nms = c("a", "b")
# 2: c(nms[1] =
# ^
setNames(c(1, 2), nms) # use this instead
# a b
# 1 2
R differs from other programming languages in its careful treatment of missing values in the reading and processing of data. The missing-data code NA
is distinct from other special codes:
NaN
;Inf
and -Inf
; andNULL
.A missing value means “there is a value here, but we don’t know what it is.” So a comparison against a missing value, like 2 == NA
or 2 != NA
, will always return NA
, since we cannot determine whether the condition is true or false. To test whether a value is missing, use the is.na
function.
NA
is different depending on the vector it is in. Sometimes, it is important to ensure that a missing value belongs to a particular class. To ensure this, the safest approach is to “slice” an object of that class with NA_integer_
. For example, to get a missing numeric value, we can use the number 10
like 10[NA_integer_]
; or for a date value, we can use the current date, Sys.Date()
, like Sys.Date()[NA_integer_]
.
Select a subvector of x
like x[i]
:
# example data
x = c("a", "b", "c", "d")
L = list(X = 1, Y = 2, Z = 3)
x[c(1, 3)]
# [1] "a" "c"
x[-c(2, 3)]
# [1] "a" "d"
L[c("Y","Z")]
# $Y
# [1] 2
#
# $Z
# [1] 3
x[c(FALSE, TRUE, TRUE, FALSE)]
# [1] "b" "c"
So, the index i
can be…
i
can even contain repeated elements, like x[c(1,1,2)]
, which is some more general “slice” of x
, not really a subvector.
x[TRUE]
in the example above? Well, in R, it acts like x[c(TRUE, TRUE, TRUE, TRUE)]
, without giving any warning. This is called recycling – the argument i
is repeated until it is long enough for the context it is used in. What about x[c(TRUE, FALSE)]
? The full rules for recycling can be found in the R intro doc.
i
as NA_integer_
to find the “missing value version of x
.” Similarly, we can set i
to 0L
to find the “empty version of x
.” Setting i
to include positions or names outside of range will just yield missing values and is typically not useful.
Convenience functions head
and tail
enable easy slicing from the start or end of a vector.
Grab a single element of a vector like x[[i]]
, where i
is a single name or number:
# example data
L = list(X = 1, Y = 2, Z = 3)
L["Z"] # slicing
# $Z
# [1] 3
L[["Z"]] # extracting
# [1] 3
The slice L["Z"]
is a list containing just one element; while L[["Z"]]
actually is the element.
L$Z
offers a handy alternative way of extracting by name from a list, but it is tougher to write a program around.
Assign to a subvector, or “subassign,” like x[i] = y
or x[i] <- y
:
# example data
x = c("a", "b", "c", "d")
L = list(X = 1, Y = 2, Z = 3)
x[c(2,4)] <- c("bada", "bing")
L[c("Y", "Z")] = list(21, 31)
The allowed values of i
are the same as we saw for slicing a vector in 2.1.7. This is fairly unsafe on account of silent coercion, recycling and failure (when i
is not a valid index of x
). We will revisit how to modify vectors in data.tables in 3.4.
?`[`
. Even though there is a ]
for every [
, it is the opening bracket that is the function’s actual name, which will come back in 2.4.3. For documentation on sub-assignment, type ?`[<-`
.
Assigning to an empty index, x[] <- "blargh"
, will overwrite all elements. Assigning to indices beyond the vector’s bounds, x[11] <- "huzzah"
, will silently extend the vector to accomodate (which is generally quite inefficient).
One can similarly assign to a single element of a list with x[[i]] <- y
. If an existing element is overwritten, no warning is given.
In R, it bad practice to dynamically grow objects, like x = 1
, x = c(x,2)
, …, x = c(x,n)
. Instead, create the vector with its final length first:
x = vector(length = 5, mode = "list")
x = character(length = 5)
x = logical(length = 5)
x = numeric(length = 5)
x = integer(length = 5)
From there, we can subassign iteratively if necessary: x[1] <- "A"
, x[2] <- "B"
, and so on.
It’s also common to initialize a vector with a particular value using rep
:
rep(3, 10)
# [1] 3 3 3 3 3 3 3 3 3 3
rep(x, n)
extends nicely to the case where x
is a vector. It does this in a few different configurations, as documented at ?rep
:
rep(c(3, 4), 5)
# [1] 3 4 3 4 3 4 3 4 3 4
rep(c(3, 4), each = 5)
# [1] 3 3 3 3 3 4 4 4 4 4
rep(c(3, 4), length.out = 5)
# [1] 3 4 3 4 3
rep(c(3, 4), c(3, 4))
# [1] 3 3 3 4 4 4 4
Use a single colon to build a sequence of integers:
3:5
# [1] 3 4 5
Use parentheses liberally.
n:n+10
does not run from n
to n+10
. Parentheses are needed: n:(n+10)
.-1:3+10
and 10-1:3
do not give the same results. Again, parentheses are needed to clarify which result is wanted.If we are running 1..n
where n
is a nonnegative integer variable, it is safer and more efficient to use seq_len
than 1:n
. It is safer in the sense that, if n
is zero, seq_len(n)
gives the correct result of a zero-length vector, while 1:0
gives c(1L, 0L)
.
If we are running 1...length(x)
alongside some vector x
, then seq_along
is the right tool:
x = c(3, 3, 4)
seq_along(x)
# [1] 1 2 3
seq.int
, seq_along
, setNames
, Sys.time
, Sys.Date
, sys.status
and system.time
. All names are case-sensitive.
From here, the variety of options starts to resemble what we saw for the rep
function in 2.1.11.1.
The function seq.int
extends the colon operator:
seq.int(5, 10)
# [1] 5 6 7 8 9 10
seq.int(5, 10, by = 2)
# [1] 5 7 9
seq.int(5, by = 3, length.out = 3)
# [1] 5 8 11
seq.int(to = 100, by = -11, length.out = 3)
# [1] 122 111 100
seq.int(5, 10, by=.5)
# [1] 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0
The seq
function does essentially the same thing; see the docs if interested in details. Other vector classes have their own seq
methods. For example, seq.Date
will allow for a sequence of dates.
As with missing values, a lot of thought was put into the treatment of categorical data in R, captured by the factor
data type:
factor(c("New York", "Tokyo", "Mumbai", "Tokyo"))
# [1] New York Tokyo Mumbai Tokyo
# Levels: Mumbai New York Tokyo
Factors often have no sense of ordering (like “Tokyo” naturally coming before “Bombay”), but one can be added if appropriate:
factor(c("Worst", "Best", "Not Bad", "Worst"), levels = c("Worst", "Not Bad", "Best"), ordered = TRUE)
# [1] Worst Best Not Bad Worst
# Levels: Worst < Not Bad < Best
To see whether a factor has an order, use is.ordered
. Details on manipulating factors can be found at ?factor
and linked pages.
There are three key functions here:
sort(x)
will sort the vector.rank(x)
tells where each element of x
is in the pecking order.order(x)
is rarely useful on its own, but y[order(x, z)]
will sort y
by x
and z
.Be careful not to use these functions on an unordered factor, where they will make no sense but run without warning.
The frank()
function from the data.table package is also useful, for its “dense rank” tie-breaking rule.
c(TRUE, 1)
?list(1, c(2,3,4))
?x = c("a","b")
and then x[c(2,1)] = c("c","d")
, what is the value of x[1]
?L$X <- list(3)
?The matrix
function is used for construction:
# example data
v = c(1,2,3,4)
matrix(v, nrow = 2, ncol = 2)
# [,1] [,2]
# [1,] 1 3
# [2,] 2 4
matrix(v, nrow = 2, ncol = 2, byrow = TRUE)
# [,1] [,2]
# [1,] 1 2
# [2,] 3 4
By default, the matrix is built with column-major order; but byrow = TRUE
will read as row-major.
The dimensions of a matrix can be extracted with the dim
function or nrow
and ncol
.
All elements of a matrix must have the same class, just like a vector. So we can have a “character matrix”, a “numeric matrix”, etc.
dim
attribute. As mentioned at the top, all data in R is stored in vectors; and as mentioned in 2.1.2, most classes are built on top of vectors. One wrinkle is that is.vector
is false for matrices. However, this makes sense in light of the doc, ?is.vector
.
Matrices can also be built from vectors with cbind
and rbind
(for “binding” columns or rows):
rbind(c(1,1), c(2,2))
# [,1] [,2]
# [1,] 1 1
# [2,] 2 2
cbind(c(1,1), c(2,2), c(3,3))
# [,1] [,2] [,3]
# [1,] 1 2 3
# [2,] 1 2 3
These functions can also build matrices from other matrices.
diag(n)
will make an identity matrix of size n
; and diag(v)
will make a diagonal matrix with v
on the diagonal.
Rows and columns can be named:
matrix(c(1,2,3,4), nrow = 2, dimnames = list(c("a","b"), c("x","y")))
# x y
# a 1 3
# b 2 4
To select a submatrix (a “matrix slice”), use m[i,j]
as in math:
# example data
m = matrix(c(1,2,3,4,5,6), nrow = 2, dimnames = list(c("a", "b"), c("x", "y", "z")))
# x y z
# a 1 3 5
# b 2 4 6
m[1, c(1,2), drop = FALSE]
# x y
# a 1 3
m["a", c("x","y"), drop = FALSE]
# x y
# a 1 3
We can get fancy with indices, in all the ways described in 2.1.7. It is often covenient to leave one index blank, which means “select all”; try m[, 2, drop = FALSE]
.
drop = FALSE
option is necessary to ensure that the result is a submatrix. R’s default behavior is to “drop” dimensions when it can, which makes for unpredictable output.
Implicitly, matrices are regarded as having observations as rows and variables as columns. As a result, head
and tail
will return rows from the top and bottom of a matrix, respectively.
Since matrices are vectors, we can extract a vector of values in the same way:
# example data
m = matrix(c(1,2,3,4,5,6), nrow = 2, dimnames = list(c("a", "b"), c("x", "y", "z")))
# x y z
# a 1 3 5
# b 2 4 6
m[3]
# [1] 3
m[c(2,3)]
# [1] 2 3
Matrices also allow extraction like X[Y]
, where Y
is a two column “index matrix”:
im = matrix(c(1,1,2,2), 2, 2, byrow = TRUE)
# [,1] [,2]
# [1,] 1 1
# [2,] 2 2
m[im]
# [1] 1 4
The first column of im
corresponds to rows, and the second to columns. Elements are extracted in the order they are listed in im
.
drop = TRUE
can be used to select a column or row, like m[, 2]
(where drop = TRUE
is the default) or a single element, like m[1, 2]
or m["a", "y"]
.
Finally, diag(m)
will extract the diagonal from a matrix; while upper.tri
and lower.tri
extract those parts (in column-major order).
Assign to elements of a matrix in one of a few ways:
X[i,j] = z
where i
and j
are rows and columns, respectivelyX[i] = z
with i
being an index vectorX[im] = z
with im
being an index matrixIn all cases, z
is a vector or matrix of values.
Arrays are the same as matrices, just in higher dimensions. For example, a three-dimensional array will have a dim
attribute of length three and have slicing syntax like a[i,j,k]
. They can be constructed like a = array(v, dim = c(1,2,2))
.
Matrix algebra functions are covered in 2.3.5. Some other handy functions are:
colSums
and rowSums
, which take sums over columns and rows, respectively.max.col
which finds, for each row, which is the first column to achieve the per-row maximum.The apply
function is popular for applying arbitrary functions along dimensions of a matrix or array. However, it is very inefficient and usually discouraged. sweep
is a similar function.
x = 1:6
, build a 3x2 matrix rowwise like m(x, _fill_this_in_)
.m = matrix(1:6, 2)
, extract a submatrix containing the first column.?do.call
and from L = list(c(1,1), c(2,2), c(3,3))
, bind the elements of L
as columns of a single matrix.m = matrix(1:4, 2)
, overwrite the upper left and lower right cells with 0
.The parser interprets ;
as the end of a command, so three commands x <- 10; y <- 5; x + y
can be written on a single line. This is rarely useful when writing a program.
The {...}
braces will return their last value, so {x <- 10; y <- 5; x + y}
will return 15
. It will also have the side effect of creating x
and y
. To avoid this side effect, use local
, like
local({
x <- 10
y <- 5
x + y
})
# [1] 15
This is also rarely needed since DT[i, {...}, by]
syntax for data.tables (3) achieves the same result.
There are convenience functions for modifying object attributes, with syntax like names(x) <- y
or names(x) = y
. A few are for…
?`names<-`
?`class<-`
?`length<-`
?`dim<-`
?`dimnames<-`
?`diag<-`
, ?`upper.tri<-`
, ?`lower.tri<-`
These are rarely needed and can be confusing for obvious reasons. Section 3.4 will discuss a different approach.
The full set of arithmetic operators is documented at ?Arithmetic
. The integer division and modulo operators look like %/%
and %%
, but otherwise, everything is standard. Other basic functions include round
, floor
, ceiling
, min
, max
, log
, exp
, sqrt
.
Essentially all computations involving missing values will return a missing value. One exception is NA^0
since no matter what number the true value is, this expression will evaluate to 1
. (Recall that a missing value means “there is some true value, but we don’t know what it is.”) The min
and max
functions can ignore missing values, as noted in their docs. 3.7.4 and 3.7.5 cover other summarizing functions, most of which similarly offer to ignore NA
s.
As in any other programming environment, floating point arithmetic in R can trip up calculations. For example, try .1 + .05 - .15 == 0
.
is_gt = c(1,5,2) > c(1,2,3); sum(is_gt)
.
The self-explanatory symbols !
, &
and |
are used to construct compound logical tests. The treatment of missing values is intuitive:
NA | TRUE
is true, since no matter what the missing value is (TRUE
or FALSE
), the compound statement must be true.NA & TRUE
is NA
, since we need to know the missing value to determine the truth or falsehood of the compound statement.FALSE
when zero and TRUE
otherwise. This allows us to use length(x)
as a test instead of length(x) > 0
. Some functions refuse to treat numeric as logical, but this is usually noted in their docs (for example, ?which
).
A couple additional symbols, ||
and &&
, do short circuiting and will be explained in 2.3.4.
For a logical vector, any
and all
summarise it in the natural way.
Most operations in R are performed elementwise:
x = c(1, 2)
y = c(2, 3)
x / y
# [1] 0.5000000 0.6666667
x ^ y
# [1] 1 8
We call functions with this behavior “vectorized.” R is a lot faster when operations are performed elementwise rather than in a loop, so there’s a common mantra to “vectorize your code,” especially when doing arithmetic. The idea is similar to translating a statistical model into matrix algebra.
c(-2,2) ^ c(1,2,3,4)
working, with no warning given. Fortunately, there is a warning if the recycling is imperfect, like c(-2,2) ^ c(1,2,3)
.
A couple of notably non-vectorized functions are &&
and ||
. These will (silently) only look at the first element on each side; and if the result can be determined using the left-hand side, for example in FALSE && yodel-e-hee-hoo
, then the right-hand side will not be evaluated. These are useful mostly for improving efficiency on the margins.
t(X)
det(X)
X %*% Y
Transposition and multiplication treat a vanilla vector as a column vector.
The undecorated *
will always work elementwise (2.3.4) and silently recycle values:
# example data
X = matrix(c(0, 0, 1, 1), 2)
Y = matrix(c(2, 3, 4, 5), 2)
# elementwise multiplication
X * Y
# [,1] [,2]
# [1,] 0 4
# [2,] 0 5
# matrix multiplication
X %*% Y
# [,1] [,2]
# [1,] 3 5
# [2,] 3 5
The order of operations is listed in ?Syntax
. In addition to the operators listed, **
is an alias for ^
. The %any%
listed there refers to a broad class of “infix” binary operators:
%/%
and %%
for integer division and modulo;%in%
to test membership (see 4.1);%*%
for matrix multiplication; andEven knowing the order of operations, I recommend using parentheses liberally, particularly in !(...)
statements.
x = 1:4
, use the functions introduced so far to compute how many elements are greater than 1.?log10
and use it to round x = 40
up to the next order of magnitude, like 10^(_fill_this_in_)
.x = c(NA, 1, 33)
, filter to elements greater than 10 (and not missing), like x[_fill_this_in_]
.?rev
and from x = 1:10
, extract the last three elements using head
and rev
. What is a more direct way?NA | 1
evaluate to TRUE
??abs
and figure out why abs(NA) + 1 > 0
evaluates to NA
. Should it?Functions in R all map input to output. In the documentation for a function, the input is covered in the Arguments section, while the output is in the Value section.
R has a variety of packages specialized by task and area of research. The Task Views on the web are a good way to browse available packages on CRAN, the Comprehensive R Archive Network.
Besides adding additional functions, many packages add new classes, for example for time series data.
It is best to install packages from CRAN or another major repository, rather than personal sites. Packages submitted to CRAN (i) have to pass some vetting procedures defined by the developers of R and and (ii) are typically used broadly, improving exposure of bugs.
To install a package from CRAN, use install.packages
:
install.packages("fortunes")
The first time this is run during an R session, a prompt will pop up, asking which mirror to download from.
After installing a package, a direct attempt at using its functions will fail:
fortune(111)
# Error in eval(expr, envir, enclos): could not find function "fortune"
Instead:
fortunes::fortune(111)
#
# You can't expect statistical procedures to rescue you from poor data.
# -- Berton Gunter (on dealing with missing values in a cluster analysis)
# R-help (April 2005)
This code works because it indicates the “namespace” that the function comes from. See ?`::`
.
Another way to use functions from a package is to attach it with library
:
library(fortunes)
fortune(347)
#
# This is a bit like asking how should I tweak my sailboat so I can explore the ocean floor.
# -- Roger Koenker (in response to a question about tweaking the quantreg package to
# handle probit and heckit models)
# R-help (May 2013)
This adds all of the package’s functions and other objects to the search path, so we can write fortune(111)
instead of the more verbose fortunes::fortune(111)
. The downside to attaching a package is the risk of “namespace conflicts”:
Fortunately, in all these cases, R will print a message about the conflicting names.
To remove a package from the namespace, there’s detach
, but I always just restart my R session instead.
To read the doc for a function in a package, just write the full path, like ?fortunes::fortune
. To see a listing of functions provided by a package, use help(package="fortunes")
or click “Index” at the bottom of the help page for any function in the package.
To update a package, just install it again:
install.packages("fortunes")
If a package is loaded (or attached), it cannot be updated, so close all R sessions where the package is loaded before updating it.
Type sessionInfo()
to see which packages are loaded (along with additional information on the current R session).
installed.packages
to view dependencies and version numbers. Some other tools, rocker and drat, look promising.
My opinion on trusting packages. If you don’t understand a package’s docs, don’t use it for anything serious. For a stats package, look for a published journal paper introducing it (often in the R Journal or the Journal of Statistical Software).
If the version number of the package is below 1.0, be ready to keep track of its development (since most package developers follow a pattern like Hadley Wickham’s). Packages for input, output or graphing are fragile in any language, so isolate the code that depends on them as much as possible.We can see the arguments to the vector function by typing ?vector
or using args
:
args(vector)
# function (mode = "logical", length = 0L)
# NULL
we see that the order of arguments here contrasts with how it was used in 2.1.11:
x = vector(length = 5, mode = "list")
The arguments are being passed out-of-order, but it still works since the arguments are being passed by name. Passing arguments by the position, like sort(x, TRUE)
, is a riskier proposition, since it is harder to get right initially and to interpret correctly later. Data.table idioms I follow for merging (3.5) all involve passing arguments out-of-order.
To apply the same function to every element of a list, use lapply
.
lapply(list(c(3,1,4), c(2,7,9), 0), min)
# [[1]]
# [1] 1
#
# [[2]]
# [1] 2
#
# [[3]]
# [1] 0
lapply
is a very important tool in R and will come up again and again.
When the function’s return value is guaranteed to be a scalar, sapply
or vapply
may be a better choice, but I almost never need those in practice.
Like any other object, functions can be assigned with <-
or =
:
f = function(x) x^2 + 3
f(4)
Or steal a function from elsewhere:
f = fortunes::fortune
For more complicated functions, use {}
:
f = function(x){
x[1] = 999
return(x)
}
return
provides the return value. If a return
command is reached before the end of a function, the rest of the code is not run. If no return
command is given, the last value is used. To return multiple objects, combine them in a list and return it instead.
As with most other objects, R allows function names to be reused without warning, like
min = max
Of course, this is usually a bad idea. To eliminate the custom function and revert to the base value, use rm
:
rm(min)
Consider a function that modifies its input:
# example function
f = function(x){
x[1] = 999
return(x)
}
Does modification inside the function alter the object passed to it?
z = c(1,2,3)
f(z)
# [1] 999 2 3
z
# [1] 1 2 3
We see that the modification does not carry over to the input itself. Usually, nothing created or altered inside a function does, meaning there are no side effects. There are ways to write functions with side effects in base R (<<-
and assign
), but these are discouraged:
fortunes::fortune("the assign function")
#
# The only people who should use the assign function are those who fully understand why you
# should never use the assign function.
# -- Greg Snow
# R-help (July 2009)
The data.table package, introduced in 3.1.2, has a different design philosophy regarding the mutability of objects by functions. Its core functions almost all hinge on side effects.
Now let’s see how global variables called inside a function are handled.
y = 1
g = function(x) x + y
y = 2
g(1)
# [1] 3
y = 3
g(1)
# [1] 4
So R only looks for y
when it needs it and no sooner. This is called “lazy evaluation.” It has advantages, but can lead to mistakes. The next two sections (2.4.7 and 2.4.8) cover ways of managing lazy evaluation.
One can use with
to bind a value to the function:
g2 = with(list(y = 4), function(x) x + y)
y = 1
g2(1)
# [1] 5
Now, the y
the function uses is part of the “environment” of the function. Environments are essentially lists, so the value can be extracted following syntax from 2.1.8:
environment(g2)[["y"]]
# [1] 4
They can also be assigned to:
environment(g2)[["y"]] <- 10
The documentation shows default values for arguments. After reading ?vector
, we know that vector()
will return the same thing as vector(mode = "logical", length = 0)
.
When writing our own functions, we can also use default values. This offers another way of getting around the the issue in 2.4.6:
g3 = function(x, y = 4) x + y
g3(1)
# [1] 5
g3(1, y = 10)
# [1] 11
Here, if we don’t pass y
as an argument, the default value prevails. Thanks to lazy evaluation, we can even define default values in terms of other arguments:
g4 = function(x, y = x) x + y
g4(2)
# [1] 4
Of course, this will still break if the definitions are circular.
When I run into a function and don’t know what package it is from, besides searching online, I also sometimes use help.search
or its ??
shortcut, like ??fortunes
. This will search all installed packages.
To inspect the R code behind a function, as with any other R object, just type its name:
nrow
# function (x)
# dim(x)[1L]
# <bytecode: 0x0000000019dc6698>
# <environment: namespace:base>
replace
# function (x, list, values)
# {
# x[list] <- values
# x
# }
# <bytecode: 0x000000001db652d8>
# <environment: namespace:base>
The last two rows of these functions show (i) that they are compiled and (ii) the environment they were defined in.
Many functions are not “exported” for direct use. Nonetheless, their code may be found using getAnywhere
or accessed directly using :::
syntax:
getAnywhere(head.default)
utils:::head.default
Many such hidden functions are “methods”; see ?Methods
. There are also several functions that should be hidden, documented at ?.subset
, ?.getNamespace
and elsewhere. These docs always have prominent notes like:
Not intended to be called directly, and only visible because of the special nature of the base namespace.
Heed the docs and don’t use any hidden or should-be-hidden functions directly.
Other functions indicate that their core functionality is written elsewhere:
max
# function (..., na.rm = FALSE) .Primitive("max")
Often, the source code for base R functions is in C and is fairly readable even for the C-illiterate. Following Joshua Ulrich’s instructions:
/src/main/names.c
c-entry
column for the function of interest./src/main/*.c
for it.ndim
, a function to count the number of dimensions an array has.?cumsum
, then write a cummean
function to compute the cumulative mean.library(package = "_your_new_package_")
to find several objects included with the package.?pmin
, then write a function that takes three vectors and computes \(\min((x-y)^+, z)\) elementwise, where\({()}^+\) denotes the positive part, testing on x = c(10,20,30); y = 15; z = c(25,15,5)
, for which the correct result is c(0, 5, 5)
.As in most programming languages, we have for
loops, if
/else
blocks, and so on. Some quirks to note:
for (i in idx) ...
to iterate over the vector idx
, even its duplicates.else if
, since there is no elseif
, elsif
or elif
.next
to bust out of the current iteration of a loop.break
to bust out of the loop entirely.for
loopsSuppose we have this loop to compute x[i]
from idx[i]
:
idx = c(1, 2, 3, 4, 5)
x = numeric(length(idx))
for (i in idx){
i2 = i^2
x[i] = 2*i2 - 1
}
There are a couple problems here:
x
multiple times, requiring extra coding if we want to reuse this code elsewhere.i
and i2
in the end.x[i]
can be computed independently.A sibling to lapply
, our tool for iterating over lists from 2.4.3, can help:
idx = c(1, 2, 3, 4, 5)
x = sapply(idx, function(i){
i2 = i^2
return(2*i2 - 1)
})
Of course, there is still one further problem:
idx = c(1, 2, 3, 4, 5)
i2 = idx^2
x = 2*i2 - 1
Vectorized code and matrix algebra are much faster than a loop. Loops should be reserved for tasks that really must be performed iteratively. When speed becomes an issue for such tasks, it is time to learn the Rcpp family of packages, which allow for coding in C++.
ESC
. This applies elsewhere in R as well, but I usually run into it with loops or one of the several functions documented at ?lapply
.
if
/else
elementwisex = if (cond) 1 else 0
, don’t. As mentioned in 2.3.2, the logical variable cond
itself is perfect for storing this dummy variable.
The base R tool for handling elementwise if
/else
rules is the function ifelse
. However, it has some serious shortcomings. A better way, similar to the gen
/replace
pattern in Stata, is covered in 3.4.4.
n
is how long the vectors are. For x = 1:4; y = c(6,1,4,1)
it should return 3
.My recommendation is always to use a data.table (introduced in 3.2) instead of a data frame, so I’ve kept this section short.
Vectors and matrices are fine once we’re estimating a statistical model. However, we typically start with data consisting of many observations of multiple variables of varying types. To store data of different types, a vector of matrix won’t work; we need a list. R offers a special list structure, the data frame, for this common situation:
# example data
DF = data.frame(
x = letters[c(1, 2, 3, 4, 5)],
y = c(1, 2, 3, 4, 5),
z = c(1, 2, 3, 4, 5) > 3
)
DF
# x y z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4 TRUE
# 5 e 5 TRUE
Data frames are printed with observations on rows and variables on columns. In many ways, they behave like matrices: working with functions like dim
, ncol
, nrow
, head
and tail
; using DF[i, j]
syntax for slices; and even using DF[im]
syntax for matrix indexing.
DF[3, ]
# x y z
# 3 c 3 FALSE
DF[, "z", drop=FALSE]
# z
# 1 FALSE
# 2 FALSE
# 3 FALSE
# 4 TRUE
# 5 TRUE
im = matrix(c(1,1,2,2), 2, 2, byrow=TRUE)
DF[im] # DON'T DO THIS
# [1] "a" "2"
The last item above, DF[im]
is nonsensical in two ways:
Naturally, if the data already is entirely numeric and the remaining analytical tasks are all linear algebra, a matrix will be simpler and almost always computationally faster than a data frame.
It is useful to keep in mind that data frames are really lists of columns, not matrices. We can use lapply
(2.4.3) to iterate over such lists:
lapply(DF, function(x) x[3])
# $x
# [1] c
# Levels: a b c d e
#
# $y
# [1] 3
#
# $z
# [1] FALSE
Here, we are extracting the third element from each column.
lapply
call by coercion to a list: as.list(DF[3, ])
. The names of similar functions can be guessed: as.matrix
, as.data.frame
, as.vector
, as.logical
, as.integer
, as.numeric
, as.character
, as.factor
, and so on. There is as(x, "class_name")
, but I have found the latter to be finicky.
To browse a data frame in a separate window, use View(DF)
.
R includes several handy constants, like pi
, the alphabet (?letters
), months (?month.name
), days of the week (?weekdays
) and US state names and abbreviations (?state
). There is no need to worry about accidentally reusing any of these contants’ names for variables (though redefining pi
might not be a great idea).
matrix
named “matrix.” One common mistake is writing scripts with T
and F
(built-in abbreviations for TRUE
and FALSE
), which is a bad idea because it is easy to accidentally break such code by overwriting, for example, T
as the number of periods in a panel and F
as some cumulative distribution function.
On a related note, ?Reserved
lists all of R’s important built-ins that cannot be overwritten.