Chapter 2 Basics

This chapter covers R’s core data structures and syntax.

2.1 Vectors

In R, all data come in vectors. If we enter x = 54.7, then x will be a numeric vector of length 1. Similarly, y = c("Yah", "Bada bing") is a character vector of length 2. Type a vector’s name to see it, and use length and class to see those attributes.

# example data
x = c(TRUE, FALSE, TRUE)

# [1]  TRUE FALSE  TRUE

length(x)

# [1] 3

class(x)

# [1] "logical"

The bracketed [1] indicates that we are seeing the first element of a vector. This printing pattern is handy with vectors too long for a single line, like c(LETTERS, letters). It is also a useful reminder that vectors are indexed starting from 1 (instead of 0 as in C, Python, et al).

Assign with = or <-. Or even like c(2,3) -> z, though this is rare and hard to read in my opinion. If an existing vector is overwritten, no warning is given.

2.1.1 Documentation

Try typing help.start(). This will open a page of links to manuals including “An Introduction to R”, “The R Language Definition” and the “R FAQ” (under “Resources”). These are also available in pdf online.

Everything is documented in convenient wikified html pages, accessible with ? or help. For example, ?TRUE, ?LETTERS or ?length. Backticks are sometimes required, as for ?`for` and ?`?`. Keep a close eye on the Description and Value sections, which concisely explain what to expect from a function. The Examples section always usefully illustrates functionality; it can be run in the console like example("LETTERS"). Additional illustrations are available using demo; type demo() for a listing.

R doesn’t do error codes, only error messages and warnings. If confused by a message, a search online or a review of the relevant docs is usually sufficient.

2.1.2 Classes

Every vanilla vector has a single class. That is, every element of the vector contains the same type of data:

c(4, c("A", "B"))

# [1] "4" "A" "B"

Coercion. The 4 is “coerced” to character above, with no warning. Since coercion is so central to R and confusing for new users, I strongly suggest reading its documentation, in the Details section of ?c.

The doc at ?c also conveniently lists all of the vanilla (“atomic”) classes. The key ones are

Logical. Use TRUE or FALSE.
Integer. Write L at the end: 1L, 2L, etc. The L will not be displayed in output.
Numeric. Write it as usual or with e notation, 2, 3.4, 3e5.
Character. Use single or double quotes, "bah", 'I say "gah"', etc.

A list is a special type of vector whose elements can be arbitrary objects, for example, a list of vectors:

# example data
L  = list("A", c(1,2))
L2 = list(FALSE)
L3 = c(L, L2)

L3

# [[1]]
# [1] "A"
# 
# [[2]]
# [1] 1 2
# 
# [[3]]
# [1] FALSE

The [[1]] printed here indicates the first element of the list, similar to the [1] we see for atomic vectors.

Fancier classes are generally built on top of atomic classes or lists. Behind the scenes, date formats are numeric or integer; while complex objects like data sets and regression results are lists. For details on storage modes, see ?typeof and the R internals documentation. While some fancy data structures like linked lists and unordered sets are absent in base R, they can be used through packages like Rcpp.

For lists, in addition to a length, we also have lengths, measuring each element:

lengths(L3)

# [1] 1 2 1

The class of an object can be tested with is.logical, etc.

2.1.3 Making comparisons

To compare two objects, use == and !=. R will silently coerce the objects’ classes to match:

"4.11" == 4.11

# [1] TRUE

In addition, there are the usual inequality operators (>, <, >=, <=), which also apply to strings, using lexicographic ordering:

"A" >= "B"

# [1] FALSE

2.1.4 Inspecting objects

For complicated objects, like L3 above, class(L3) is not very informative. Examining the “structure” of the object is usually more useful:

str(L3)

# List of 3
#  $ : chr "A"
#  $ : num [1:2] 1 2
#  $ : logi FALSE

Sometimes, we need closer inspection for debugging. The functions unclass (which removes the class) and dput (which prints R code to create the object) are helpful for this. See 3.6.4.5 for an example relevant to date and time objects.

To explore the set of loaded objects, see the tools in 3.7.1.

2.1.5 Named elements

A vector’s elements can be named:

c(a = 1, b = 2)

# a b 
# 1 2

list(A = c(1, 2), B = 4)

# $A
# [1] 1 2
# 
# $B
# [1] 4

The names of an objects’ elements can be accessed with the names function.

Passing arguments. Inside c(A = 1), the equals sign is not interchangeable with the <-, so don’t write c(A <- 1). Think of c as a function with syntax like c(argname = argvalue, argname2 = argvalue2, ...). The equals sign has the special role of giving names to the function’s arguments. See 2.4.2 for details.

To assign names programmatically…

nms = c("a", "b")
c(nms[1] = 1, nms[2] = 2) # error

# Error: <text>:2:10: unexpected '='
# 1: nms = c("a", "b")
# 2: c(nms[1] =
#             ^

setNames(c(1, 2), nms)    # use this instead

# a b 
# 1 2

2.1.6 Missing values

R differs from other programming languages in its careful treatment of missing values in the reading and processing of data. The missing-data code NA is distinct from other special codes:

a non-number, NaN;
positive and negative infinity, Inf and -Inf; and
the absence of an object, NULL.

A missing value means “there is a value here, but we don’t know what it is.” So a comparison against a missing value, like 2 == NA or 2 != NA, will always return NA, since we cannot determine whether the condition is true or false. To test whether a value is missing, use the is.na function.

Implementation of missing values. The class of an NA is different depending on the vector it is in. Sometimes, it is important to ensure that a missing value belongs to a particular class. To ensure this, the safest approach is to “slice” an object of that class with NA_integer_. For example, to get a missing numeric value, we can use the number 10 like 10[NA_integer_]; or for a date value, we can use the current date, Sys.Date(), like Sys.Date()[NA_integer_].

2.1.7 Slicing to a subvector

Select a subvector of x like x[i]:

# example data
x = c("a", "b", "c", "d")
L = list(X = 1, Y = 2, Z = 3)

x[c(1, 3)]

# [1] "a" "c"

x[-c(2, 3)]

# [1] "a" "d"

L[c("Y","Z")]

# $Y
# [1] 2
# 
# $Z
# [1] 3

x[c(FALSE, TRUE, TRUE, FALSE)]

# [1] "b" "c"

So, the index i can be…

a vector of positions to include;
a negated vector of positions to exclude;
a vector of element names; or
a true/false vector, with a selection rule for each position.

i can even contain repeated elements, like x[c(1,1,2)], which is some more general “slice” of x, not really a subvector.

Recycling. What would be the natural behavior of x[TRUE] in the example above? Well, in R, it acts like x[c(TRUE, TRUE, TRUE, TRUE)], without giving any warning. This is called recycling – the argument i is repeated until it is long enough for the context it is used in. What about x[c(TRUE, FALSE)]? The full rules for recycling can be found in the R intro doc.

Special slices. There are a few edge cases worth noting. As mentioned in 2.1.6, it can be useful to set i as NA_integer_ to find the “missing value version of x.” Similarly, we can set i to 0L to find the “empty version of x.” Setting i to include positions or names outside of range will just yield missing values and is typically not useful.

Convenience functions head and tail enable easy slicing from the start or end of a vector.

2.1.8 Extracting from a list

Grab a single element of a vector like x[[i]], where i is a single name or number:

# example data
L = list(X = 1, Y = 2, Z = 3)

L["Z"] # slicing

# $Z
# [1] 3

L[["Z"]] # extracting

# [1] 3

The slice L["Z"] is a list containing just one element; while L[["Z"]] actually is the element.

L$Z offers a handy alternative way of extracting by name from a list, but it is tougher to write a program around.

2.1.9 Assigning to a subvector

Assign to a subvector, or “subassign,” like x[i] = y or x[i] <- y:

# example data
x = c("a", "b", "c", "d")
L = list(X = 1, Y = 2, Z = 3)

x[c(2,4)] <- c("bada", "bing")
L[c("Y", "Z")] = list(21, 31)

The allowed values of i are the same as we saw for slicing a vector in 2.1.7. This is fairly unsafe on account of silent coercion, recycling and failure (when i is not a valid index of x). We will revisit how to modify vectors in data.tables in 3.4.

Hard-to-find documentation. For documentation on slicing and extracting, type ?`[`. Even though there is a ] for every [, it is the opening bracket that is the function’s actual name, which will come back in 2.4.3. For documentation on sub-assignment, type ?`[<-`.

Assigning to an empty index, x[] <- "blargh", will overwrite all elements. Assigning to indices beyond the vector’s bounds, x[11] <- "huzzah", will silently extend the vector to accomodate (which is generally quite inefficient).

2.1.10 Assigning to a list element

One can similarly assign to a single element of a list with x[[i]] <- y. If an existing element is overwritten, no warning is given.

2.1.11 Initializing

In R, it bad practice to dynamically grow objects, like x = 1, x = c(x,2), …, x = c(x,n). Instead, create the vector with its final length first:

x = vector(length = 5, mode = "list")
x = character(length = 5)
x = logical(length = 5)
x = numeric(length = 5)
x = integer(length = 5)

From there, we can subassign iteratively if necessary: x[1] <- "A", x[2] <- "B", and so on.

2.1.11.1 Repeating values

It’s also common to initialize a vector with a particular value using rep:

rep(3, 10)

#  [1] 3 3 3 3 3 3 3 3 3 3

rep(x, n) extends nicely to the case where x is a vector. It does this in a few different configurations, as documented at ?rep:

rep(c(3, 4), 5)

#  [1] 3 4 3 4 3 4 3 4 3 4

rep(c(3, 4), each = 5)

#  [1] 3 3 3 3 3 4 4 4 4 4

rep(c(3, 4), length.out = 5)

# [1] 3 4 3 4 3

rep(c(3, 4), c(3, 4))

# [1] 3 3 3 4 4 4 4

2.1.11.2 Integer sequences

Use a single colon to build a sequence of integers:

3:5

# [1] 3 4 5

Use parentheses liberally.

n:n+10 does not run from n to n+10. Parentheses are needed: n:(n+10).
-1:3+10 and 10-1:3 do not give the same results. Again, parentheses are needed to clarify which result is wanted.

See 2.3.6 for more details.

If we are running 1..n where n is a nonnegative integer variable, it is safer and more efficient to use seq_len than 1:n. It is safer in the sense that, if n is zero, seq_len(n) gives the correct result of a zero-length vector, while 1:0 gives c(1L, 0L).

If we are running 1...length(x) alongside some vector x, then seq_along is the right tool:

x = c(3, 3, 4)
seq_along(x)

# [1] 1 2 3

R function names. It is not worth trying to make sense of R’s naming conventions. Presumably thanks to historical accidents (some involving R’s ancestor, S), we have names like seq.int, seq_along, setNames, Sys.time, Sys.Date, sys.status and system.time. All names are case-sensitive.

2.1.11.3 Fancy sequences

From here, the variety of options starts to resemble what we saw for the rep function in 2.1.11.1.

The function seq.int extends the colon operator:

seq.int(5, 10)

# [1]  5  6  7  8  9 10

seq.int(5, 10, by = 2)

# [1] 5 7 9

seq.int(5, by = 3, length.out = 3)

# [1]  5  8 11

seq.int(to = 100, by = -11, length.out = 3)

# [1] 122 111 100

seq.int(5, 10, by=.5)

#  [1]  5.0  5.5  6.0  6.5  7.0  7.5  8.0  8.5  9.0  9.5 10.0

The seq function does essentially the same thing; see the docs if interested in details. Other vector classes have their own seq methods. For example, seq.Date will allow for a sequence of dates.

2.1.12 Factors

As with missing values, a lot of thought was put into the treatment of categorical data in R, captured by the factor data type:

factor(c("New York", "Tokyo", "Mumbai", "Tokyo"))

# [1] New York Tokyo    Mumbai   Tokyo   
# Levels: Mumbai New York Tokyo

Factors often have no sense of ordering (like “Tokyo” naturally coming before “Bombay”), but one can be added if appropriate:

factor(c("Worst", "Best", "Not Bad", "Worst"), levels = c("Worst", "Not Bad", "Best"), ordered = TRUE)

# [1] Worst   Best    Not Bad Worst  
# Levels: Worst < Not Bad < Best

To see whether a factor has an order, use is.ordered. Details on manipulating factors can be found at ?factor and linked pages.

2.1.13 Sorting, rank, order

There are three key functions here:

sort(x) will sort the vector.
rank(x) tells where each element of x is in the pecking order.
order(x) is rarely useful on its own, but y[order(x, z)] will sort y by x and z.

Be careful not to use these functions on an unordered factor, where they will make no sense but run without warning.

The frank() function from the data.table package is also useful, for its “dense rank” tie-breaking rule.

2.1.14 Exercises

What is the class of c(TRUE, 1)?
What is the length of list(1, c(2,3,4))?
If I run x = c("a","b") and then x[c(2,1)] = c("c","d"), what is the value of x[1]?
What is the command to open documenation for for the assignment procedure used in L$X <- list(3)?

2.2 Matrices and arrays

The matrix function is used for construction:

# example data
v = c(1,2,3,4)

matrix(v, nrow = 2, ncol = 2)

#      [,1] [,2]
# [1,]    1    3
# [2,]    2    4

matrix(v, nrow = 2, ncol = 2, byrow = TRUE)

#      [,1] [,2]
# [1,]    1    2
# [2,]    3    4

By default, the matrix is built with column-major order; but byrow = TRUE will read as row-major.

The dimensions of a matrix can be extracted with the dim function or nrow and ncol.

All elements of a matrix must have the same class, just like a vector. So we can have a “character matrix”, a “numeric matrix”, etc.

Vectors with attributes. In fact, a matrix is a vector, just augmented by the dim attribute. As mentioned at the top, all data in R is stored in vectors; and as mentioned in 2.1.2, most classes are built on top of vectors. One wrinkle is that is.vector is false for matrices. However, this makes sense in light of the doc, ?is.vector.

2.2.1 Building matrices

Matrices can also be built from vectors with cbind and rbind (for “binding” columns or rows):

rbind(c(1,1), c(2,2))

#      [,1] [,2]
# [1,]    1    1
# [2,]    2    2

cbind(c(1,1), c(2,2), c(3,3))

#      [,1] [,2] [,3]
# [1,]    1    2    3
# [2,]    1    2    3

These functions can also build matrices from other matrices.

diag(n) will make an identity matrix of size n; and diag(v) will make a diagonal matrix with v on the diagonal.

2.2.2 Named rows and columns

Rows and columns can be named:

matrix(c(1,2,3,4), nrow = 2, dimnames = list(c("a","b"), c("x","y")))

#   x y
# a 1 3
# b 2 4

2.2.3 Slicing to a submatrix

To select a submatrix (a “matrix slice”), use m[i,j] as in math:

# example data
m = matrix(c(1,2,3,4,5,6), nrow = 2, dimnames = list(c("a", "b"), c("x", "y", "z")))

#   x y z
# a 1 3 5
# b 2 4 6

m[1, c(1,2), drop = FALSE]

#   x y
# a 1 3

m["a", c("x","y"), drop = FALSE]

#   x y
# a 1 3

We can get fancy with indices, in all the ways described in 2.1.7. It is often covenient to leave one index blank, which means “select all”; try m[, 2, drop = FALSE].

Dropped dimensions. The drop = FALSE option is necessary to ensure that the result is a submatrix. R’s default behavior is to “drop” dimensions when it can, which makes for unpredictable output.

Implicitly, matrices are regarded as having observations as rows and variables as columns. As a result, head and tail will return rows from the top and bottom of a matrix, respectively.

2.2.4 Extracting from matrices

Since matrices are vectors, we can extract a vector of values in the same way:

# example data
m = matrix(c(1,2,3,4,5,6), nrow = 2, dimnames = list(c("a", "b"), c("x", "y", "z")))

#   x y z
# a 1 3 5
# b 2 4 6

m[3]

# [1] 3

m[c(2,3)]

# [1] 2 3

Matrices also allow extraction like X[Y], where Y is a two column “index matrix”:

im = matrix(c(1,1,2,2), 2, 2, byrow = TRUE)

#      [,1] [,2]
# [1,]    1    1
# [2,]    2    2

m[im]

# [1] 1 4

The first column of im corresponds to rows, and the second to columns. Elements are extracted in the order they are listed in im.

drop = TRUE can be used to select a column or row, like m[, 2] (where drop = TRUE is the default) or a single element, like m[1, 2] or m["a", "y"].

Finally, diag(m) will extract the diagonal from a matrix; while upper.tri and lower.tri extract those parts (in column-major order).

2.2.5 Subassigning to matrices

Assign to elements of a matrix in one of a few ways:

X[i,j] = z where i and j are rows and columns, respectively
X[i] = z with i being an index vector
X[im] = z with im being an index matrix

In all cases, z is a vector or matrix of values.

2.2.6 Arrays

Arrays are the same as matrices, just in higher dimensions. For example, a three-dimensional array will have a dim attribute of length three and have slicing syntax like a[i,j,k]. They can be constructed like a = array(v, dim = c(1,2,2)).

2.2.7 Computing with matrices

Matrix algebra functions are covered in 2.3.5. Some other handy functions are:

colSums and rowSums, which take sums over columns and rows, respectively.
max.col which finds, for each row, which is the first column to achieve the per-row maximum.

The apply function is popular for applying arbitrary functions along dimensions of a matrix or array. However, it is very inefficient and usually discouraged. sweep is a similar function.

2.2.8 Exercises

From x = 1:6, build a 3x2 matrix rowwise like m(x, _fill_this_in_).
From m = matrix(1:6, 2), extract a submatrix containing the first column.
Read ?do.call and from L = list(c(1,1), c(2,2), c(3,3)), bind the elements of L as columns of a single matrix.
From m = matrix(1:4, 2), overwrite the upper left and lower right cells with 0.

2.3 Syntax

The parser interprets ; as the end of a command, so three commands x <- 10; y <- 5; x + y can be written on a single line. This is rarely useful when writing a program.

The {...} braces will return their last value, so {x <- 10; y <- 5; x + y} will return 15. It will also have the side effect of creating x and y. To avoid this side effect, use local, like

local({
  x <- 10
  y <- 5
  x + y
})

# [1] 15

This is also rarely needed since DT[i, {...}, by] syntax for data.tables (3) achieves the same result.

2.3.1 Assigning to attributes

There are convenience functions for modifying object attributes, with syntax like names(x) <- y or names(x) = y. A few are for…

vector element names, ?`names<-`
vector class, ?`class<-`
vector length, ?`length<-`
matrix or array dimensions, ?`dim<-`
matrix or array names, ?`dimnames<-`
matrix parts, ?`diag<-`, ?`upper.tri<-`, ?`lower.tri<-`

These are rarely needed and can be confusing for obvious reasons. Section 3.4 will discuss a different approach.

2.3.2 Arithmetic

The full set of arithmetic operators is documented at ?Arithmetic. The integer division and modulo operators look like %/% and %%, but otherwise, everything is standard. Other basic functions include round, floor, ceiling, min, max, log, exp, sqrt.

Essentially all computations involving missing values will return a missing value. One exception is NA^0 since no matter what number the true value is, this expression will evaluate to 1. (Recall that a missing value means “there is some true value, but we don’t know what it is.”) The min and max functions can ignore missing values, as noted in their docs. 3.7.4 and 3.7.5 cover other summarizing functions, most of which similarly offer to ignore NAs.

As in any other programming environment, floating point arithmetic in R can trip up calculations. For example, try .1 + .05 - .15 == 0.

Arithmetic with logical values. Particularly when exploring data, we often want to tally observations meeting some condition. To do this in R, we can treat true/false like 1/0: is_gt = c(1,5,2) > c(1,2,3); sum(is_gt).

2.3.3 Logical operators

The self-explanatory symbols !, & and | are used to construct compound logical tests. The treatment of missing values is intuitive:

NA | TRUE is true, since no matter what the missing value is (TRUE or FALSE), the compound statement must be true.
NA & TRUE is NA, since we need to know the missing value to determine the truth or falsehood of the compound statement.
And so on.

Logic with numeric values. Numbers are treated as FALSE when zero and TRUE otherwise. This allows us to use length(x) as a test instead of length(x) > 0. Some functions refuse to treat numeric as logical, but this is usually noted in their docs (for example, ?which).

A couple additional symbols, || and &&, do short circuiting and will be explained in 2.3.4.

For a logical vector, any and all summarise it in the natural way.

2.3.4 Elementwise operations

Most operations in R are performed elementwise:

x = c(1, 2)
y = c(2, 3)
x / y

# [1] 0.5000000 0.6666667

x ^ y

# [1] 1 8

We call functions with this behavior “vectorized.” R is a lot faster when operations are performed elementwise rather than in a loop, so there’s a common mantra to “vectorize your code,” especially when doing arithmetic. The idea is similar to translating a statistical model into matrix algebra.

Recycling, again. As we saw in 2.1.7, recycling kicks in for any vectorized function. So we have c(-2,2) ^ c(1,2,3,4) working, with no warning given. Fortunately, there is a warning if the recycling is imperfect, like c(-2,2) ^ c(1,2,3).

A couple of notably non-vectorized functions are && and ||. These will (silently) only look at the first element on each side; and if the result can be determined using the left-hand side, for example in FALSE && yodel-e-hee-hoo, then the right-hand side will not be evaluated. These are useful mostly for improving efficiency on the margins.

2.3.5 Matrix algebra

Transpose t(X)
Take determinant det(X)
Multiply X %*% Y

Transposition and multiplication treat a vanilla vector as a column vector.

The undecorated * will always work elementwise (2.3.4) and silently recycle values:

# example data
X = matrix(c(0, 0, 1, 1), 2)
Y = matrix(c(2, 3, 4, 5), 2)

# elementwise multiplication
X * Y

#      [,1] [,2]
# [1,]    0    4
# [2,]    0    5

# matrix multiplication
X %*% Y

#      [,1] [,2]
# [1,]    3    5
# [2,]    3    5

2.3.6 Order of operations

The order of operations is listed in ?Syntax. In addition to the operators listed, ** is an alias for ^. The %any% listed there refers to a broad class of “infix” binary operators:

%/% and %% for integer division and modulo;
%in% to test membership (see 4.1);
%*% for matrix multiplication; and
any custom infix operators defined by the user or in loaded packages.

Even knowing the order of operations, I recommend using parentheses liberally, particularly in !(...) statements.

2.3.7 Exercises

From x = 1:4, use the functions introduced so far to compute how many elements are greater than 1.
Read ?log10 and use it to round x = 40 up to the next order of magnitude, like 10^(_fill_this_in_).
From x = c(NA, 1, 33), filter to elements greater than 10 (and not missing), like x[_fill_this_in_].
Read ?rev and from x = 1:10, extract the last three elements using head and rev. What is a more direct way?
Why does NA | 1 evaluate to TRUE?
Read ?abs and figure out why abs(NA) + 1 > 0 evaluates to NA. Should it?

2.4 Functions

Functions in R all map input to output. In the documentation for a function, the input is covered in the Arguments section, while the output is in the Value section.

2.4.1 Community-made functions

2.4.1.1 Exploring packages

R has a variety of packages specialized by task and area of research. The Task Views on the web are a good way to browse available packages on CRAN, the Comprehensive R Archive Network.

Besides adding additional functions, many packages add new classes, for example for time series data.

2.4.1.2 Installing packages

It is best to install packages from CRAN or another major repository, rather than personal sites. Packages submitted to CRAN (i) have to pass some vetting procedures defined by the developers of R and and (ii) are typically used broadly, improving exposure of bugs.

To install a package from CRAN, use install.packages:

install.packages("fortunes")

The first time this is run during an R session, a prompt will pop up, asking which mirror to download from.

2.4.1.3 Using functions from packages

After installing a package, a direct attempt at using its functions will fail:

fortune(111)

# Error in eval(expr, envir, enclos): could not find function "fortune"

Instead:

fortunes::fortune(111)

# 
# You can't expect statistical procedures to rescue you from poor data.
#    -- Berton Gunter (on dealing with missing values in a cluster analysis)
#       R-help (April 2005)

This code works because it indicates the “namespace” that the function comes from. See ?`::`.

Another way to use functions from a package is to attach it with library:

library(fortunes)
fortune(347)

# 
# This is a bit like asking how should I tweak my sailboat so I can explore the ocean floor.
#    -- Roger Koenker (in response to a question about tweaking the quantreg package to
#       handle probit and heckit models)
#       R-help (May 2013)

This adds all of the package’s functions and other objects to the search path, so we can write fortune(111) instead of the more verbose fortunes::fortune(111). The downside to attaching a package is the risk of “namespace conflicts”:

If the attached package has an object with the same name as something defined by the user, that object won’t enter the global namespace.
If two attached packages contain objects with the same name, the last-attached object gets precedence for entry into the global namespace.
Some packages will elbow their way past built-in base functions when attached, with a warning. When this happens, it should probably be taken as a red flag.

Fortunately, in all these cases, R will print a message about the conflicting names.

To remove a package from the namespace, there’s detach, but I always just restart my R session instead.

2.4.1.4 Reading package docs

To read the doc for a function in a package, just write the full path, like ?fortunes::fortune. To see a listing of functions provided by a package, use help(package="fortunes") or click “Index” at the bottom of the help page for any function in the package.

2.4.1.5 Updating packages

To update a package, just install it again:

install.packages("fortunes")

If a package is loaded (or attached), it cannot be updated, so close all R sessions where the package is loaded before updating it.

Type sessionInfo() to see which packages are loaded (along with additional information on the current R session).

Managing packages. Use installed.packages to view dependencies and version numbers. Some other tools, rocker and drat, look promising.

My opinion on trusting packages. If you don’t understand a package’s docs, don’t use it for anything serious. For a stats package, look for a published journal paper introducing it (often in the R Journal or the Journal of Statistical Software).

If the version number of the package is below 1.0, be ready to keep track of its development (since most package developers follow a pattern like Hadley Wickham’s). Packages for input, output or graphing are fragile in any language, so isolate the code that depends on them as much as possible.

2.4.2 Passing arguments

We can see the arguments to the vector function by typing ?vector or using args:

args(vector)

# function (mode = "logical", length = 0L) 
# NULL

we see that the order of arguments here contrasts with how it was used in 2.1.11:

x = vector(length = 5, mode = "list")

The arguments are being passed out-of-order, but it still works since the arguments are being passed by name. Passing arguments by the position, like sort(x, TRUE), is a riskier proposition, since it is harder to get right initially and to interpret correctly later. Data.table idioms I follow for merging (3.5) all involve passing arguments out-of-order.

2.4.3 Iterating over a list

To apply the same function to every element of a list, use lapply.

lapply(list(c(3,1,4), c(2,7,9), 0), min)

# [[1]]
# [1] 1
# 
# [[2]]
# [1] 2
# 
# [[3]]
# [1] 0

lapply is a very important tool in R and will come up again and again.

When the function’s return value is guaranteed to be a scalar, sapply or vapply may be a better choice, but I almost never need those in practice.

2.4.4 Writing functions

Like any other object, functions can be assigned with <- or =:

f = function(x) x^2 + 3
f(4)

Or steal a function from elsewhere:

f = fortunes::fortune

For more complicated functions, use {}:

f = function(x){
  x[1] = 999
  return(x)
}

return provides the return value. If a return command is reached before the end of a function, the rest of the code is not run. If no return command is given, the last value is used. To return multiple objects, combine them in a list and return it instead.

As with most other objects, R allows function names to be reused without warning, like

min = max

Of course, this is usually a bad idea. To eliminate the custom function and revert to the base value, use rm:

rm(min)

2.4.5 Scoping

Consider a function that modifies its input:

# example function
f = function(x){
  x[1] = 999
  return(x)
}

Does modification inside the function alter the object passed to it?

z = c(1,2,3)
f(z)

# [1] 999   2   3

# [1] 1 2 3

We see that the modification does not carry over to the input itself. Usually, nothing created or altered inside a function does, meaning there are no side effects. There are ways to write functions with side effects in base R (<<- and assign), but these are discouraged:

fortunes::fortune("the assign function")

# 
# The only people who should use the assign function are those who fully understand why you
# should never use the assign function.
#    -- Greg Snow
#       R-help (July 2009)

The data.table package, introduced in 3.1.2, has a different design philosophy regarding the mutability of objects by functions. Its core functions almost all hinge on side effects.

2.4.6 Lazy evaluation

Now let’s see how global variables called inside a function are handled.

y = 1
g = function(x) x + y
y = 2
g(1)

# [1] 3

y = 3
g(1)

# [1] 4

So R only looks for y when it needs it and no sooner. This is called “lazy evaluation.” It has advantages, but can lead to mistakes. The next two sections (2.4.7 and 2.4.8) cover ways of managing lazy evaluation.

2.4.7 Environment

One can use with to bind a value to the function:

g2 = with(list(y = 4), function(x) x + y)
y = 1
g2(1)

# [1] 5

Now, the y the function uses is part of the “environment” of the function. Environments are essentially lists, so the value can be extracted following syntax from 2.1.8:

environment(g2)[["y"]]

# [1] 4

They can also be assigned to:

environment(g2)[["y"]] <- 10

2.4.8 Default arguments

The documentation shows default values for arguments. After reading ?vector, we know that vector() will return the same thing as vector(mode = "logical", length = 0).

When writing our own functions, we can also use default values. This offers another way of getting around the the issue in 2.4.6:

g3 = function(x, y = 4) x + y
g3(1)

# [1] 5

g3(1, y = 10)

# [1] 11

Here, if we don’t pass y as an argument, the default value prevails. Thanks to lazy evaluation, we can even define default values in terms of other arguments:

g4 = function(x, y = x) x + y
g4(2)

# [1] 4

Of course, this will still break if the definitions are circular.

2.4.9 Finding a function by name

When I run into a function and don’t know what package it is from, besides searching online, I also sometimes use help.search or its ?? shortcut, like ??fortunes. This will search all installed packages.

2.4.10 Inspecting function source code

To inspect the R code behind a function, as with any other R object, just type its name:

nrow

# function (x) 
# dim(x)[1L]
# <bytecode: 0x0000000019dc6698>
# <environment: namespace:base>

replace

# function (x, list, values) 
# {
#     x[list] <- values
#     x
# }
# <bytecode: 0x000000001db652d8>
# <environment: namespace:base>

The last two rows of these functions show (i) that they are compiled and (ii) the environment they were defined in.

Many functions are not “exported” for direct use. Nonetheless, their code may be found using getAnywhere or accessed directly using ::: syntax:

getAnywhere(head.default)
utils:::head.default

Many such hidden functions are “methods”; see ?Methods. There are also several functions that should be hidden, documented at ?.subset, ?.getNamespace and elsewhere. These docs always have prominent notes like:

Not intended to be called directly, and only visible because of the special nature of the base namespace.

Heed the docs and don’t use any hidden or should-be-hidden functions directly.

Other functions indicate that their core functionality is written elsewhere:

max

# function (..., na.rm = FALSE)  .Primitive("max")

Often, the source code for base R functions is in C and is fairly readable even for the C-illiterate. Following Joshua Ulrich’s instructions:

Navigate to a copy of the source code.
Browse to /src/main/names.c
Read the c-entry column for the function of interest.
Search the text of /src/main/*.c for it.

2.4.11 Exercises

Write ndim, a function to count the number of dimensions an array has.
Review (2.1.11.2) and read ?cumsum, then write a cummean function to compute the cumulative mean.
Install a new package, then use library(package = "_your_new_package_") to find several objects included with the package.
Read ?pmin, then write a function that takes three vectors and computes $\min((x-y)^+, z)$ elementwise, where${()}^+$ denotes the positive part, testing on x = c(10,20,30); y = 15; z = c(25,15,5), for which the correct result is c(0, 5, 5).

2.5 Loops and control flow

As in most programming languages, we have for loops, if/else blocks, and so on. Some quirks to note:

Write for (i in idx) ... to iterate over the vector idx, even its duplicates.
Write else if, since there is no elseif, elsif or elif.
Use next to bust out of the current iteration of a loop.
Use break to bust out of the loop entirely.

2.5.1 Thinking about `for` loops

Suppose we have this loop to compute x[i] from idx[i]:

idx = c(1, 2, 3, 4, 5)

x   = numeric(length(idx))
for (i in idx){
  i2    = i^2
  x[i]  = 2*i2 - 1
}

There are a couple problems here:

We have to refer to x multiple times, requiring extra coding if we want to reuse this code elsewhere.
The global environment is polluted by i and i2 in the end.
There is no need for an iterative approach, since each x[i] can be computed independently.

A sibling to lapply, our tool for iterating over lists from 2.4.3, can help:

idx = c(1, 2, 3, 4, 5)
x = sapply(idx, function(i){
  i2    = i^2
  return(2*i2 - 1)
})

Of course, there is still one further problem:

This task can be done with vectorized code!

idx = c(1, 2, 3, 4, 5)
i2  = idx^2
x   = 2*i2 - 1

Vectorized code and matrix algebra are much faster than a loop. Loops should be reserved for tasks that really must be performed iteratively. When speed becomes an issue for such tasks, it is time to learn the Rcpp family of packages, which allow for coding in C++.

Breaking an operation. When running a malfeasant loop (eating up too much memory or taking too long) in the R console, the operation can be broken by pressing ESC. This applies elsewhere in R as well, but I usually run into it with loops or one of the several functions documented at ?lapply.

2.5.2 `if`/`else` elementwise

Dummy variables. First, if tempted to do x = if (cond) 1 else 0, don’t. As mentioned in 2.3.2, the logical variable cond itself is perfect for storing this dummy variable.

The base R tool for handling elementwise if/else rules is the function ifelse. However, it has some serious shortcomings. A better way, similar to the gen/replace pattern in Stata, is covered in 3.4.4.

2.5.3 Exercises

Write a vectorized function for the root mean squared deviation between two vectors, $\sqrt{\frac{1}{n}\sum_1^n (x_i - y_i)^2 }$ where n is how long the vectors are. For x = 1:4; y = c(6,1,4,1) it should return 3.

2.6 Data frames

My recommendation is always to use a data.table (introduced in 3.2) instead of a data frame, so I’ve kept this section short.

Vectors and matrices are fine once we’re estimating a statistical model. However, we typically start with data consisting of many observations of multiple variables of varying types. To store data of different types, a vector of matrix won’t work; we need a list. R offers a special list structure, the data frame, for this common situation:

# example data
DF = data.frame(
  x = letters[c(1, 2, 3, 4, 5)], 
  y = c(1, 2, 3, 4, 5), 
  z = c(1, 2, 3, 4, 5) > 3
)
DF

#   x y     z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4  TRUE
# 5 e 5  TRUE

Data frames are printed with observations on rows and variables on columns. In many ways, they behave like matrices: working with functions like dim, ncol, nrow, head and tail; using DF[i, j] syntax for slices; and even using DF[im] syntax for matrix indexing.

DF[3, ]

#   x y     z
# 3 c 3 FALSE

DF[, "z", drop=FALSE]

#       z
# 1 FALSE
# 2 FALSE
# 3 FALSE
# 4  TRUE
# 5  TRUE

im = matrix(c(1,1,2,2), 2, 2, byrow=TRUE) 
DF[im] # DON'T DO THIS

# [1] "a" "2"

The last item above, DF[im] is nonsensical in two ways:

First, all elements in the result are coerced to a single type.
Second, we are referring to variables by column number instead of name.

Naturally, if the data already is entirely numeric and the remaining analytical tasks are all linear algebra, a matrix will be simpler and almost always computationally faster than a data frame.

2.6.1 Iterating over columns

It is useful to keep in mind that data frames are really lists of columns, not matrices. We can use lapply (2.4.3) to iterate over such lists:

lapply(DF, function(x) x[3])

# $x
# [1] c
# Levels: a b c d e
# 
# $y
# [1] 3
# 
# $z
# [1] FALSE

Here, we are extracting the third element from each column.

Manual coercion. We can also get the result of that lapply call by coercion to a list: as.list(DF[3, ]). The names of similar functions can be guessed: as.matrix, as.data.frame, as.vector, as.logical, as.integer, as.numeric, as.character, as.factor, and so on. There is as(x, "class_name"), but I have found the latter to be finicky.

2.6.2 Inspection

To browse a data frame in a separate window, use View(DF).

2.7 Built-in constants

R includes several handy constants, like pi, the alphabet (?letters), months (?month.name), days of the week (?weekdays) and US state names and abbreviations (?state). There is no need to worry about accidentally reusing any of these contants’ names for variables (though redefining pi might not be a great idea).

Masking. Besides built-in constants, it is also possible to reuse names of most variables and functions from base R. In general, this is harmless unless its going to make things confusing, like a matrix named “matrix.” One common mistake is writing scripts with T and F (built-in abbreviations for TRUE and FALSE), which is a bad idea because it is easy to accidentally break such code by overwriting, for example, T as the number of periods in a panel and F as some cumulative distribution function.

On a related note, ?Reserved lists all of R’s important built-ins that cannot be overwritten.